building-with-llms

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Building with LLMs

基于LLM进行开发

Help the user build effective AI applications using practical techniques from 60 product leaders and AI practitioners.

帮助用户运用来自60位产品负责人和AI从业者的实用技巧，构建高效的AI应用。

How to Help

如何提供帮助

When the user asks for help building with LLMs:

Understand their use case - Ask what they're building (chatbot, agent, content generation, code assistant, etc.)
Diagnose the problem - Help identify if issues are prompt-related, context-related, or model-selection related
Apply relevant techniques - Share specific prompting patterns, architecture approaches, or evaluation methods
Challenge common mistakes - Push back on over-reliance on vibes, skipping evals, or using the wrong model for the task

当用户请求基于LLM的开发帮助时：

理解其使用场景 - 询问用户正在构建的产品类型（聊天机器人、Agent、内容生成工具、代码助手等）
诊断问题根源 - 帮助识别问题是否与提示词、上下文或模型选择相关
应用相关技巧 - 分享特定的提示词模式、架构方法或评估手段
指出常见误区 - 纠正过度依赖主观感受、跳过评估、任务与模型不匹配等问题

Core Principles

核心原则

Prompting

提示词工程

Few-shot examples beat descriptions Sander Schulhoff: "If there's one technique I'd recommend, it's few-shot prompting—giving examples of what you want. Instead of describing your writing style, paste a few previous emails and say 'write like this.'"

Provide your point of view Wes Kao: "Sharing my POV makes output way better. Don't just ask 'What would you say?' Tell it: 'I want to say no, but I'd like to preserve the relationship. Here's what I'd ideally do...'"

Use decomposition for complex tasks Sander Schulhoff: "Ask 'What subproblems need solving first?' Get the list, solve each one, then synthesize. Don't ask the model to solve everything at once."

Self-criticism improves output Sander Schulhoff: "Ask the LLM to check and critique its own response, then improve it. Models can catch their own errors when prompted to look."

Roles help style, not accuracy Sander Schulhoff: "Roles like 'Act as a professor' don't help accuracy tasks. But they're great for controlling tone and style in creative work."

Put context at the beginning Sander Schulhoff: "Place long context at the start of your prompt. It gets cached (cheaper), and the model won't forget its task when processing."

少样本示例优于文字描述 Sander Schulhoff："如果只推荐一个技巧，那就是少样本提示——给模型展示你想要的示例。与其描述写作风格，不如粘贴几封过往邮件并要求‘照此风格撰写’。"

提供明确立场 Wes Kao："分享我的立场能大幅提升输出质量。不要只问‘你会怎么说？’，而是明确告知：‘我想拒绝对方，但又想维护关系。我的理想做法是……’"

复杂任务拆解处理 Sander Schulhoff："先问‘需要先解决哪些子问题？’，列出清单后逐个解决，再整合结果。不要让模型一次性解决所有问题。"

自我批判提升输出 Sander Schulhoff："让LLM检查并批判自己的回复，然后进行优化。当被提示时，模型能够发现自身的错误。"

角色设定影响风格而非准确性 Sander Schulhoff："‘扮演教授’这类角色设定对准确性任务帮助不大，但在创意工作中能很好地控制语气和风格。"

前置上下文内容 Sander Schulhoff："将长上下文放在提示词开头。这样不仅成本更低（缓存机制），还能避免模型处理时遗忘任务目标。"

Architecture

架构设计

Context engineering > prompt engineering Bret Taylor: "If a model makes a bad decision, it's usually lack of context. Fix it at the root—feed better data via MCP or RAG."

RAG quality = data prep quality Chip Huyen: "The biggest gains come from data preparation, not vector database choice. Rewrite source data into Q&A format. Add annotations for context humans take for granted."

Layer models for robustness Bret Taylor: "Having AI supervise AI is effective. Layer cognitive steps—one model generates, another reviews. This moves you from 90% to 99% accuracy."

Use specialized models for specialized tasks Amjad Masad: "We use Claude Sonnet for coding, other models for critiquing. A 'society of models' with different roles outperforms one general model."

200ms is the latency threshold Ryan J. Salva (GitHub Copilot): "The sweet spot for real-time suggestions is ~200ms. Slower feels like an interruption. Design your architecture around this constraint."

上下文工程优于提示词工程 Bret Taylor："如果模型做出错误决策，通常是因为缺乏上下文。从根源解决问题——通过MCP或RAG提供更优质的数据。"

RAG的质量取决于数据预处理质量 Chip Huyen："最大的收益来自数据预处理，而非向量数据库的选择。将源数据重写为问答格式，添加人类默认知晓的上下文注释。"

多层模型提升鲁棒性 Bret Taylor："让AI监督AI是有效的做法。分层设计认知步骤——一个模型生成内容，另一个模型审核。这能将准确率从90%提升至99%。"

针对特定任务使用专用模型 Amjad Masad："我们使用Claude Sonnet进行编码，其他模型用于审核。拥有不同角色的‘模型协作体系’性能优于单一通用模型。"

200ms是延迟阈值 Ryan J. Salva（GitHub Copilot）："实时建议的最佳延迟约为200ms。更慢的速度会让用户感到中断。需围绕这个约束设计架构。"

Evaluation

评估方法

Evals are mandatory, not optional Kevin Weil (OpenAI): "Writing evals is becoming a core product skill. A 60% reliable model needs different UX than 95% or 99.5%. You can't design without knowing your accuracy."

Binary scores > Likert scales Hamel Husain: "Force Pass/Fail, not 1-5 scores. Scales produce meaningless averages like '3.7'. Binary forces real decisions."

Start with vibes, evolve to evals Howie Liu: "For novel products, start with open-ended vibes testing. Only move to formal evals once use cases converge."

Validate your LLM judge Hamel Husain: "If using LLM-as-judge, you must eval the eval. Measure agreement with human experts. Iterate until it aligns."

评估是必需项，而非可选项 Kevin Weil（OpenAI）："编写评估用例正在成为核心产品技能。准确率60%的模型与95%或99.5%的模型需要不同的用户体验设计。如果不了解准确率，就无法进行合理设计。"

二元评分优于李克特量表 Hamel Husain："强制使用通过/不通过的二元评分，而非1-5分的量表。量表会产生‘3.7’这类无意义的平均值，而二元评分能推动实际决策。"

从主观感受入手，逐步过渡到正式评估 Howie Liu："对于创新产品，先从开放式的主观测试开始。只有当使用场景明确后，再转向正式评估。"

验证LLM评估器的可靠性 Hamel Husain："如果使用LLM作为评估器，必须对评估过程本身进行验证。衡量其与人类专家的一致性，不断迭代直至对齐。"

Building & Iteration

开发与迭代

Retry failures—models are stochastic Benjamin Mann (Anthropic): "If it fails, try the exact same prompt again. Success rates are much higher on retry than on banging on a broken approach."

Be ambitious in your asks Benjamin Mann: "The difference between effective and ineffective Claude Code users: ambitious requests. Ask for the big change, not incremental tweaks."

Cross-pollinate between models Guillermo Rauch: "When stuck after 100+ iterations, copy the code to a different model (e.g., from v0 to ChatGPT o1). Fresh perspective unblocks you."

Compounding engineering Dan Shipper: "For every unit of work, make the next unit easier. Save prompts that work. Build a library. Your team's AI effectiveness compounds."

重试失败的请求——模型具有随机性 Benjamin Mann（Anthropic）："如果请求失败，尝试再次发送完全相同的提示词。重试的成功率远高于反复修改有问题的方法。"

提出更高要求 Benjamin Mann："高效与低效的Claude Code用户之间的区别在于：前者会提出更高要求。要求大幅改进，而非增量调整。"

跨模型借鉴思路 Guillermo Rauch："当经过100多次迭代仍陷入瓶颈时，将代码复制到另一个模型（例如从v0切换到ChatGPT o1）。全新的视角能帮你突破瓶颈。"

复利式工程 Dan Shipper："每完成一项工作，都要让下一项工作更轻松。保存有效的提示词，建立资源库。团队的AI效能会因此复利增长。"

Working with AI Tools

AI工具使用

Learn to read and debug, not memorize syntax Amjad Masad: "The ROI on coding doubles every 6 months because AI amplifies it. Focus on reading code and debugging—syntax is handled."

Use chat mode to understand Anton Osika: "Use 'chat mode' to ask the AI to explain its logic. 'Why did you do this? What am I missing?' Treat it as a tutor."

Vibe coding is a real skill Elena Verna: "I put vibe coding on my resume. Build functional prototypes with natural language before handing to engineering."

学习阅读和调试，而非记忆语法 Amjad Masad："AI的加持让编码的投资回报率每6个月翻一番。重点关注代码阅读和调试——语法问题交给AI处理。"

使用聊天模式理解逻辑 Anton Osika："使用‘聊天模式’让AI解释其逻辑。询问‘你为什么这么做？我忽略了什么？’把它当作导师。"

Vibe编码是一项实用技能 Elena Verna："我把Vibe编码写进了简历。先用自然语言构建功能原型，再交给工程团队开发。"

Questions to Help Users

用于帮助用户的问题

"What are you building and what's the core user problem?"
"What does the model get wrong most often?"
"Are you measuring success systematically or going on vibes?"
"What context does the model have access to?"
"Have you tried few-shot examples?"
"What happens when you retry failed prompts?"

"你正在构建什么产品，核心用户问题是什么？"
"模型最常出现的错误是什么？"
"你是在系统性地衡量成功，还是仅依赖主观感受？"
"模型可以获取哪些上下文信息？"
"你尝试过少样本示例吗？"
"重试失败的提示词会发生什么？"

Common Mistakes to Flag

需要指出的常见误区

Vibes forever - Eventually you need real evals, not just "it feels good"
Prompt-only thinking - Often the fix is better context, not better prompts
One model for everything - Different models excel at different tasks
Giving up after one failure - Stochastic systems need retries
Skipping the human review - AI output needs human validation, especially early on

永远依赖主观感受 - 最终你需要真实的评估，而不只是‘感觉不错’
仅关注提示词 - 解决方案往往是优化上下文，而非改进提示词
单一模型通用于所有任务 - 不同模型擅长不同的任务
一次失败就放弃 - 随机系统需要多次重试
跳过人工审核 - AI输出需要人工验证，尤其是在早期阶段

Deep Dive

深入学习

For all 110 insights from 60 guests, see

references/guest-insights.md

如需获取来自60位嘉宾的全部110条见解，请查看

references/guest-insights.md