prompt-engineering

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Engineering for LLM Pipelines

LLM流水线的提示词工程

Core Philosophy: Bridging the Two Gulfs

核心理念:弥合两大鸿沟

Effective prompt engineering is fundamentally about closing two gaps between human intent and model behavior. Understanding which gap you're dealing with determines whether prompt refinement will actually solve your problem.
有效的提示词工程从根本上来说,是要弥合人类意图与模型行为之间的两大差距。明确你所面对的是哪类差距,决定了提示词优化是否能真正解决你的问题。

The Gulf of Specification (Developer → LLM)

规格鸿沟(开发者 → LLM)

This gulf separates what you mean from what you actually wrote in the prompt. Your intent — the task you want the LLM to perform — is often only loosely captured by the words you write. Specifying tasks precisely in natural language is surprisingly hard.
Even prompts that seem clear often leave crucial details unstated. For example:
"Extract the sender's name and summarize the key requests in this email."
This sounds specific, but critical questions are left unanswered:
  • Should the summary be a paragraph or a bulleted list?
  • Should the sender be the display name, the full email address, or both?
  • Should the summary include implicit requests, or only explicit ones?
  • How concise or detailed should the summary be?
Without complete instructions, the model is forced to guess your true intent, producing inconsistent outputs. Underspecified prompts are usually a direct result of not looking at real data — you don't know what edge cases and ambiguities exist until you see them.
Key insight: Prompt clarity often matters as much as task complexity. Many teams rush to build evaluators for preferences they never specified in the prompt (like concise responses or a specific structure). The better approach is to first include such instructions explicitly, and only create an evaluator if the LLM still fails to follow them.
这一鸿沟分隔了你的真实意图你在提示词中实际写下的内容。你希望LLM执行的任务,往往无法通过你写下的文字精准传达。用自然语言精确描述任务其实远比想象中困难。
即使看似清晰的提示词,也常常遗漏关键细节。例如:
"提取这封邮件的发件人姓名,并总结关键请求。"
这句话听起来很具体,但仍存在诸多未解答的关键问题:
  • 总结应该是段落形式还是项目符号列表?
  • 发件人是显示名称、完整邮箱地址,还是两者都要?
  • 总结是否需要包含隐含请求,还是只提取明确说明的请求?
  • 总结的简洁程度或详细程度要求如何?
在缺乏完整指令的情况下,模型只能猜测你的真实意图,从而产生不一致的输出。提示词规格不足通常是因为没有参考真实数据——只有看到实际数据,你才会发现其中存在的边缘情况和歧义。
核心洞见:提示词的清晰度与任务复杂度同等重要。许多团队急于为提示词中从未明确的偏好(如简洁回复或特定结构)构建评估器。更好的做法是先在提示词中明确此类指令,只有当LLM仍无法遵循时,再创建评估器。

The Gulf of Generalization (Data → LLM)

泛化鸿沟(数据 → LLM)

This gulf separates your data from the model's actual behavior across diverse inputs. Even if prompts are carefully written, LLMs may behave inconsistently on different inputs.
Example: An email processing pipeline might encounter an email mentioning a public figure like "Elon Musk" in the body. The model might mistakenly extract that name as the sender, even though it's unrelated to the actual email metadata. This is not a prompting error — it's a generalization failure where the model applies instructions incorrectly on unusual inputs.
The Gulf of Generalization will always exist to some degree. No model will ever be perfectly accurate on all inputs.
这一鸿沟分隔了你的数据与模型在多样化输入下的实际行为。即使提示词撰写得十分严谨,LLM在处理不同输入时仍可能表现出不一致性。
示例:某邮件处理流水线可能会遇到一封正文中提到公众人物“Elon Musk”的邮件。模型可能会错误地将该姓名提取为发件人,尽管这与实际邮件元数据无关。这并非提示词错误——而是泛化缺陷,即模型在处理特殊输入时错误地应用了指令。
泛化鸿沟在某种程度上始终存在。没有任何模型能在所有输入上都保持完美的准确性。

Why This Distinction Matters

区分这两类差距的重要性

Fix specification first, then measure generalization. There are two reasons:
  1. Efficiency: Many specification failures can be resolved rapidly by simply adding clarity or detail to an existing prompt. It's wasteful to build an automated evaluator for a failure mode that a prompt edit would fix.
  2. Measurement validity: You want evaluations to reflect the LLM's ability to generalize from clear instructions, not its capacity to decipher your ambiguous intent. Evaluating poorly specified tasks essentially measures how well the LLM can "read your mind," which isn't reliable.
Decision framework when you see a failure:
  • Ask: "Did I clearly specify what I wanted?" If no → fix the prompt (Specification issue).
  • Ask: "Were the instructions clear but the model still got it wrong?" If yes → this is a Generalization issue worth building an evaluator for.

先修复规格问题,再衡量泛化能力。原因有二:
  1. 效率:许多规格缺陷只需在现有提示词中增加清晰度或细节即可快速解决。如果一个提示词修改就能解决问题,那么为该故障模式构建自动化评估器就是一种浪费。
  2. 测量有效性:你希望评估结果能反映LLM根据清晰指令进行泛化的能力,而非它解读模糊意图的能力。评估规格不完善的任务,本质上是衡量LLM“读懂你心思”的能力,这并不可靠。
遇到故障时的决策框架
  • 问自己:“我是否明确说明了我的需求?”如果没有 → 修复提示词(规格问题)。
  • 问自己:“指令清晰但模型仍输出错误?”如果是 → 这属于泛化问题,值得构建评估器。

Prompt Structure: Seven Components

提示词结构:七大组件

A well-structured prompt typically includes several key pieces. Not every prompt needs all of them, but knowing the full toolkit helps you decide what's needed.
结构良好的提示词通常包含多个关键部分。并非每个提示词都需要所有组件,但了解完整的工具包能帮助你判断所需内容。

1. Role and Objective

1. 角色与目标

Clearly define the persona or role the LLM should adopt and its overall goal. This sets the stage for desired behavior and helps guide tone and reasoning style, especially for open-ended tasks.
You are an expert technical writer tasked with explaining complex AI concepts to a non-technical audience.
You are a careful tax advisor reviewing client filings for potential issues.
明确定义LLM应扮演的角色或人物设定及其整体目标。这为期望的行为奠定了基础,有助于引导语气和推理风格,尤其适用于开放式任务。
You are an expert technical writer tasked with explaining complex AI concepts to a non-technical audience.
You are a careful tax advisor reviewing client filings for potential issues.

2. Instructions / Response Rules

2. 指令/响应规则

This is the core component. Provide clear, specific, and unambiguous directives. Modern models interpret instructions literally, so be explicit about what to do and what not to do.
Use bullet points or numbered lists for multiple instructions. For complex instruction sets, break them into sub-categories.
undefined
这是核心组件。提供清晰、具体、无歧义的指令。现代模型会字面解读指令,因此要明确说明要做什么以及不要做什么
对于多条指令,使用项目符号或编号列表。对于复杂的指令集,可将其分解为子类别。
undefined

Task

Task

Summarize the following research paper abstract.
Summarize the following research paper abstract.

Constraints

Constraints

  • The summary must be exactly three sentences long.
  • Avoid using technical jargon above a high-school reading level.
  • Do not include any personal opinions or interpretations.
  • The summary must be exactly three sentences long.
  • Avoid using technical jargon above a high-school reading level.
  • Do not include any personal opinions or interpretations.

Tone and Style

Tone and Style

  • Use active voice.
  • Write for a general audience.
undefined
  • Use active voice.
  • Write for a general audience.
undefined

3. Context

3. 上下文

The relevant background information, data, or text the LLM needs. This could be a customer email, a document to summarize, a code snippet to debug, or user dialogue history.
When providing multiple documents or long context, clear delimiters are crucial (see Component 7).
<customer_email>
[Insert the full text of the customer email here]
</customer_email>
LLM所需的相关背景信息、数据或文本。这可以是客户邮件、待总结的文档、待调试的代码片段或用户对话历史。
当提供多个文档或长上下文时,清晰的分隔符至关重要(见第7组件)。
<customer_email>
[Insert the full text of the customer email here]
</customer_email>

4. Examples (Few-Shot Prompting)

4. 示例(小样本提示)

Provide one or more examples of desired input-output pairs. This is highly effective for guiding the model towards the correct format, style, and level of detail. Examples can also clarify nuanced instructions or demonstrate complex tool usage.
Critical rule: Ensure that any important behavior demonstrated in your examples is also explicitly stated in your rules/instructions. Examples illustrate; rules specify.
undefined
提供一个或多个期望的输入-输出对。这对引导模型生成正确的格式、风格和详细程度非常有效。示例还能澄清模糊的指令或演示复杂工具的使用方法。
关键规则:确保示例中展示的任何重要行为也在规则/指令中明确说明。示例用于演示,规则用于明确规定。
undefined

Example

Example

Input email: "Hi team, can we move the Thursday standup to 2pm? Also, please review the Q3 deck before Friday."
Output: { "sender": "Unknown (no signature)", "requests": [ "Reschedule Thursday standup to 2pm", "Review Q3 deck before Friday" ], "urgency": "medium" }
undefined
Input email: "Hi team, can we move the Thursday standup to 2pm? Also, please review the Q3 deck before Friday."
Output: { "sender": "Unknown (no signature)", "requests": [ "Reschedule Thursday standup to 2pm", "Review Q3 deck before Friday" ], "urgency": "medium" }
undefined

5. Reasoning Steps (Chain-of-Thought)

5. 推理步骤(思维链)

For complex problems, instruct the model to think step-by-step or outline a specific reasoning process. This encourages the model to break down the problem and leads to more accurate outputs.
Before generating the summary, first identify the main hypothesis, then list the key supporting evidence, and finally explain the primary conclusion. Then, write the summary.
对于复杂问题,指示模型逐步思考或遵循特定的推理过程。这能鼓励模型分解问题,从而提高输出的准确性和完整性。
Before generating the summary, first identify the main hypothesis, then list the key supporting evidence, and finally explain the primary conclusion. Then, write the summary.

6. Output Formatting Constraints

6. 输出格式约束

Explicitly define the desired structure, format, or constraints for the response. This is critical for programmatic use of the output.
Respond using only JSON format with the following keys:
- sender_name (string)
- main_issue (string) 
- suggested_action_items (array of strings)
Ensure your response is a single paragraph and ends with a question to the user.
明确定义响应的期望结构、格式或约束。这对于以编程方式使用输出至关重要。
Respond using only JSON format with the following keys:
- sender_name (string)
- main_issue (string) 
- suggested_action_items (array of strings)
Ensure your response is a single paragraph and ends with a question to the user.

7. Delimiters and Structure

7. 分隔符与结构

Use clear delimiters (Markdown headers, triple backticks, XML tags) to separate different parts of your prompt. This helps the model understand distinct components, especially in long or complex prompts.
Recommended organization for complex prompts:
  1. Overarching instructions or role definitions at the beginning
  2. Context and examples in the middle
  3. Reiterate key instructions or output format requirements at the end
For cache efficiency: Place static instructions before any user-provided or changing data. This maximizes KV cache reuse across requests and reduces inference cost.

使用清晰的分隔符(Markdown标题、三重反引号、XML标签)分隔提示词的不同部分。这有助于模型理解不同的组件,尤其是在长提示词或复杂提示词中。
复杂提示词的推荐组织结构
  1. 总体指令或角色定义放在开头
  2. 上下文和示例放在中间
  3. 结尾重申关键指令或输出格式要求
缓存效率优化:将静态指令放在用户提供的或可变数据之前。这能最大化请求之间的KV缓存复用率,降低推理成本。

The Iterative Refinement Process

迭代优化流程

Finding the perfect prompt is rarely immediate. It's an iterative cycle:
  1. Write a prompt
  2. Test it on various inputs (especially real data when available)
  3. Analyze the outputs — identify failure modes
  4. Classify each failure: Specification or Generalization?
  5. Refine the prompt for Specification failures
  6. Repeat
找到完美的提示词很少能一蹴而就。这是一个循环迭代的过程:
  1. 撰写提示词
  2. 测试:在各种输入上测试(尤其要使用真实数据,如果有的话)
  3. 分析输出——识别故障模式
  4. 分类每个故障:规格问题还是泛化问题?
  5. 优化:针对规格缺陷优化提示词
  6. 重复上述步骤

Important Warning on Prompt Optimization Tools

关于提示词优化工具的重要警告

Avoid automated prompt-writing and optimization tools in the early stages of development. Writing the prompt yourself forces you to externalize your specification and clarify your thinking. People who delegate prompt writing to a black box too aggressively struggle to fully understand their failure modes. After you have experience looking at your data and understanding failures, you can introduce these tools — but do so carefully.

在开发初期,应避免使用自动化提示词撰写和优化工具。亲自撰写提示词能迫使你明确自己的需求,理清思路。过早将提示词撰写工作交给黑箱工具的人,往往难以充分理解其故障模式。当你积累了足够的数据分析经验并了解故障原因后,可以谨慎引入这些工具。

Quick Wins for Prompt Improvement

提示词优化的快速见效方法

When a prompt isn't working well, try these low-effort, high-impact changes first:
  1. Clarify ambiguous wording. If the model gets confused about phrasing (e.g., "West Berkeley" vs. "Berkeley West"), update the prompt to be more explicit.
  2. Add a few examples. Include 2–3 representative input/output pairs targeting observed failure cases.
  3. Use role-based guidance. A persona like "You are a careful tax advisor..." can guide tone and reasoning, especially for open-ended tasks.
  4. Ask for step-by-step reasoning. For tasks involving logic or multiple steps, explicitly asking the model to "think step by step" improves correctness and completeness.
  5. Specify what NOT to do. Often, failures come from the model doing something you didn't want but also didn't explicitly prohibit.
  6. Break complex tasks into subtasks. Instead of one massive prompt, decompose into sequential steps (extract → filter → summarize → format).

当提示词效果不佳时,先尝试这些低投入、高回报的修改:
  1. 明确模糊表述:如果模型对某些表述感到困惑(例如“West Berkeley”与“Berkeley West”),更新提示词使其更明确。
  2. 添加示例:针对观察到的故障案例,添加2-3个具有代表性的输入-输出对。
  3. 使用角色引导:像“你是一位严谨的税务顾问...”这样的人物设定,能引导语气和推理方式,尤其适用于开放式任务。
  4. 要求逐步推理:对于涉及逻辑或多步骤的任务,明确要求模型“逐步思考”能提高正确性和完整性。
  5. 明确禁止行为:故障往往源于模型做了你不希望但未明确禁止的事情。
  6. 将复杂任务分解为子任务:不要使用一个庞大的提示词,而是将任务分解为多个连续步骤(提取→过滤→总结→格式化)。

LLM-as-Judge Prompt Design

LLM-as-Judge提示词设计

When building automated evaluators that use an LLM to judge outputs, the same principles apply — but with additional structure. Each evaluator should target a single failure mode with a binary Pass/Fail judgment.
当构建使用LLM作为评判者的自动化评估器时,同样适用上述原则,但需要额外的结构。每个评估器应针对单一故障模式,并给出二元通过/失败的判断。

Four Essential Components

四大核心组件

  1. Clear task and evaluation criterion. Focus on one well-scoped failure mode. Vague criteria lead to unreliable judgments. Instead of asking whether an email is "good," ask whether "the tone is appropriate for a luxury buyer persona."
  2. Precise Pass/Fail definitions. Define exactly what counts as Pass (failure absent) and Fail (failure present), grounded in your observed failure descriptions.
  3. Few-shot examples. Include labeled outputs that clearly Pass and clearly Fail. Draw these from human-labeled traces when possible. If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.
  4. Structured output format. The judge should respond in a consistent, machine-readable format — typically JSON with
    reasoning
    (1–2 sentence explanation) and
    answer
    ("Pass" or "Fail").
  1. 清晰的任务和评估标准:聚焦一个范围明确的故障模式。模糊的标准会导致不可靠的判断。不要问邮件是否“好”,而是问“语气是否符合高端买家的人物设定”。
  2. 精确的通过/失败定义:基于观察到的故障描述,明确定义什么情况属于通过(无故障),什么情况属于失败(存在故障)。
  3. 小样本示例:包含明确通过和明确失败的带标签输出。尽可能使用人工标注的示例。如果使用更精细的评分尺度(如1-3级严重程度),则需为每个评分等级提供示例。
  4. 结构化输出格式:评判者应使用一致的、机器可读的格式响应——通常是包含
    reasoning
    (1-2句话的解释)和
    answer
    (“Pass”或“Fail”)的JSON。

Example Judge Prompt

评判者提示词示例

You are an expert evaluator assessing outputs from a real estate assistant chatbot.

Your Task: Determine if the assistant-generated email to a client uses a tone appropriate for the specified client persona.

Evaluation Criterion: Tone Appropriateness

Definition of Pass/Fail:
- Fail: The email's tone, language, or level of formality is inconsistent with or unsuitable for the described client persona.
- Pass: The email's tone, language, and formality align well with the client persona's expectations.

Client Personas Overview:
- Luxury Buyers: Expect polished, highly professional, and deferential language. Avoid slang or excessive casualness.
- First-Time Homebuyers: Benefit from a friendly, reassuring, and patient tone. Avoid overly complex jargon.
- Investors: Prefer concise, data-driven, and direct communication. Avoid effusiveness.

Output Format: Return your evaluation as a JSON object with two keys:
1. reasoning: A brief explanation (1-2 sentences) for your decision.
2. answer: Either "Pass" or "Fail".

Examples:
---
Input:
Client Persona: Luxury Buyer
Generated Email: "Hey there! Got some awesome listings for you. Super views, totally posh. Wanna check 'em out?"

Evaluation: {"reasoning": "Uses excessive slang and an overly casual tone unsuitable for a Luxury Buyer persona.", "answer": "Fail"}
---
Input:
Client Persona: First-Time Homebuyer
Generated Email: "Good morning! I've found a few properties that seem like a great fit for getting started in the market, keeping your budget in mind."

Evaluation: {"reasoning": "The tone is friendly, reassuring, and avoids jargon — appropriate for a first-time homebuyer.", "answer": "Pass"}
---

Now evaluate the following:

Client Persona: {persona}
Generated Email: {email}

You are an expert evaluator assessing outputs from a real estate assistant chatbot.

Your Task: Determine if the assistant-generated email to a client uses a tone appropriate for the specified client persona.

Evaluation Criterion: Tone Appropriateness

Definition of Pass/Fail:
- Fail: The email's tone, language, or level of formality is inconsistent with or unsuitable for the described client persona.
- Pass: The email's tone, language, and formality align well with the client persona's expectations.

Client Personas Overview:
- Luxury Buyers: Expect polished, highly professional, and deferential language. Avoid slang or excessive casualness.
- First-Time Homebuyers: Benefit from a friendly, reassuring, and patient tone. Avoid overly complex jargon.
- Investors: Prefer concise, data-driven, and direct communication. Avoid effusiveness.

Output Format: Return your evaluation as a JSON object with two keys:
1. reasoning: A brief explanation (1-2 sentences) for your decision.
2. answer: Either "Pass" or "Fail".

Examples:
---
Input:
Client Persona: Luxury Buyer
Generated Email: "Hey there! Got some awesome listings for you. Super views, totally posh. Wanna check 'em out?"

Evaluation: {"reasoning": "Uses excessive slang and an overly casual tone unsuitable for a Luxury Buyer persona.", "answer": "Fail"}
---
Input:
Client Persona: First-Time Homebuyer
Generated Email: "Good morning! I've found a few properties that seem like a great fit for getting started in the market, keeping your budget in mind."

Evaluation: {"reasoning": "The tone is friendly, reassuring, and avoids jargon — appropriate for a first-time homebuyer.", "answer": "Pass"}
---

Now evaluate the following:

Client Persona: {persona}
Generated Email: {email}

Anti-Patterns to Avoid

需避免的反模式

  • Vague instructions: "Make it good" or "summarize well" — these force the model to guess your criteria.
  • Missing edge case handling: Not specifying what to do when expected data is absent (e.g., no sender signature in an email).
  • Likert scales in evaluation: Complex scoring scales (1-5) are harder to calibrate than binary Pass/Fail. Prefer binary judgments.
  • Examples without rules: Demonstrating behavior in examples that isn't also stated in the instructions. The model may not generalize from examples alone.
  • Overly long monolithic prompts: When a prompt exceeds a few hundred words, consider decomposing the task into steps.
  • Evaluating unspecified behavior: Building evaluators for things you never told the model to do. Fix the prompt first.

  • 模糊指令:“做得好”或“好好总结”——这些会迫使模型猜测你的标准。
  • 遗漏边缘情况处理:未指定当预期数据缺失时的处理方式(例如邮件中没有发件人签名)。
  • 评估中使用李克特量表:复杂的评分尺度(1-5分)比二元通过/失败更难校准。优先选择二元判断。
  • 仅有示例无规则:示例中展示的行为未在指令中明确说明。模型可能无法仅从示例中进行泛化。
  • 过长的单一提示词:当提示词超过几百字时,考虑将任务分解为多个步骤。
  • 评估未明确的行为:为从未告知模型要做的事情构建评估器。先修复提示词。

Workflow: Applying This Skill

工作流:应用本技能

When asked to help with a prompt, follow this process:
  1. Understand the task: What is the LLM supposed to do? What data does it operate on?
  2. Identify the current failure: Is the output wrong? Inconsistent? In the wrong format?
  3. Classify the failure: Is this a Specification gap (unclear instructions) or a Generalization gap (model limitation)?
  4. For Specification failures: Apply the seven components framework. Which components are missing or underspecified?
  5. For Generalization failures: Consider task decomposition, adding examples targeting the failure case, or recommending an evaluator.
  6. Review against anti-patterns: Check the prompt doesn't fall into common traps.
  7. Recommend iteration: Suggest testing the revised prompt on diverse inputs, especially edge cases from real data.
When writing prompts from scratch, start with Components 1 (Role), 2 (Instructions), and 6 (Output Format) as the minimum viable prompt, then layer in Context, Examples, Reasoning Steps, and Delimiters as complexity demands.
当被要求协助处理提示词时,请遵循以下流程:
  1. 理解任务:LLM应该做什么?它处理的数据是什么?
  2. 识别当前故障:输出是否错误?是否不一致?格式是否错误?
  3. 分类故障:这是规格鸿沟(指令不明确)还是泛化鸿沟(模型局限性)?
  4. 针对规格缺陷:应用七大组件框架。哪些组件缺失或规格不足?
  5. 针对泛化缺陷:考虑任务分解、添加针对故障案例的示例,或建议构建评估器。
  6. 对照反模式检查:确保提示词没有陷入常见陷阱。
  7. 建议迭代:建议在多样化输入(尤其是真实数据中的边缘情况)上测试修改后的提示词。
从零开始撰写提示词时,先以组件1(角色)、2(指令)和6(输出格式)作为最小可行提示词,然后根据任务复杂度逐步添加上下文、示例、推理步骤和分隔符。