prompt-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Engineer

提示词工程师

Overview

概述

This skill covers systematic design and iteration of prompts for large language models (LLMs). It applies proven techniques — zero-shot, few-shot learning, chain-of-thought reasoning, role prompting, structured output constraints, system prompt design, and multi-step prompt chaining — to improve accuracy, consistency, and reliability of LLM outputs. The skill is applicable to any LLM API (OpenAI, Anthropic, Gemini, Mistral, open-source models) and covers both single-turn and multi-turn conversation design, as well as production-grade prompt templates with variable injection.
本技能涵盖大语言模型(LLM)提示词的系统化设计与迭代。它应用经过验证的技术——零样本、少样本学习、思维链推理、角色提示、结构化输出约束、系统提示词设计以及多步骤提示词链——来提升LLM输出的准确性、一致性和可靠性。该技能适用于所有LLM API(OpenAI、Anthropic、Gemini、Mistral以及开源模型),涵盖单轮和多轮对话设计,以及支持变量注入的生产级提示词模板。

When to Use

适用场景

  • A prompt produces inconsistent, vague, or off-format outputs and needs iteration
  • Designing prompts that must return structured JSON, XML, Markdown, or CSV output
  • Building few-shot examples to guide classification, extraction, or transformation tasks
  • Creating system prompts that establish persona, tone, constraints, or output rules
  • Chaining multiple prompts together for complex multi-step reasoning tasks
  • Reducing hallucination by adding grounding instructions, citations requirements, or self-checks
  • Optimizing a prompt for a specific model (GPT-4, Claude, Llama, etc.) given its strengths
  • Converting a vague user request into a precise, production-ready prompt template
  • 提示词输出不一致、模糊或格式不符,需要迭代优化
  • 设计需返回结构化JSON、XML、Markdown或CSV格式的提示词
  • 构建少样本示例以指导分类、提取或转换任务
  • 创建用于设定角色、语气、约束或输出规则的系统提示词
  • 将多个提示词链接起来完成复杂的多步骤推理任务
  • 通过添加基础指令、引用要求或自我检查来减少幻觉输出
  • 根据目标模型(GPT-4、Claude、Llama等)的优势优化提示词
  • 将模糊的用户请求转换为精准的生产级提示词模板

When NOT to Use

不适用场景

  • Fine-tuning or training a model on new data (use model training skills)
  • Evaluating model quality across a benchmark suite (use eval-designer skill)
  • Writing application code that calls the LLM API (use a coding skill)
  • Comparing different LLMs for a use case (use model-comparator skill)
  • RAG pipeline design (retrieval-augmented generation requires its own architecture skill)
  • 在新数据上微调或训练模型(请使用模型训练类技能)
  • 跨基准套件评估模型质量(请使用eval-designer技能)
  • 编写调用LLM API的应用代码(请使用编码类技能)
  • 针对特定用例对比不同LLM(请使用model-comparator技能)
  • RAG管道设计(检索增强生成需要专用的架构类技能)

Quick Reference

快速参考

TaskApproach
Get consistent structured outputAdd explicit format spec + JSON schema example in the prompt
Improve reasoning accuracyUse chain-of-thought: "Think step by step before answering"
Classify text reliablyProvide 2–3 labeled few-shot examples per class
Set model persona and constraintsWrite a detailed system prompt before the user turn
Handle long complex tasksBreak into a prompt chain with intermediate outputs
Reduce hallucinationsInstruct model to cite sources or say "I don't know" explicitly
Make outputs deterministicLower temperature + explicit format constraints
任务方法
获取一致的结构化输出在提示词中添加明确的格式规范+JSON schema示例
提升推理准确性使用思维链:“在回答前逐步思考”
可靠地分类文本为每个类别提供2-3个带标签的少样本示例
设定模型角色与约束在用户轮次前编写详细的系统提示词
处理长复杂任务将其拆分为带中间输出的提示词链
减少幻觉输出指示模型引用来源或明确说“我不知道”
让输出具有确定性降低温度参数+添加明确的格式约束

Instructions

操作步骤

  1. Define the task precisely — Write a one-sentence task definition: what the model must do, what the input is, and what the output must look like. Vague tasks produce vague outputs. Example: "Extract all company names and their associated revenue figures from the following earnings call transcript and return them as a JSON array."
  2. Choose the prompting technique — Select based on task complexity:
    • Zero-shot: Simple tasks with clear instructions. No examples needed.
    • Few-shot: Classification, formatting, or style tasks. Provide 2–5 labeled examples.
    • Chain-of-thought (CoT): Math, logic, multi-step reasoning. Add "Think step by step."
    • Role prompting: Tasks requiring expertise or persona. "You are a senior tax attorney…"
    • Self-consistency: Run the same CoT prompt N times and majority-vote the answer.
    • Prompt chaining: Decompose a complex task into sequential prompts where each feeds the next.
  3. Write the system prompt — For APIs that support system prompts (OpenAI, Anthropic), put persistent instructions here: role, output format, constraints, what to do when uncertain. Keep it under 500 tokens unless the task genuinely requires more context.
  4. Structure the user prompt — Use clear delimiters to separate instructions from input data. Use XML tags (
    <document>
    ,
    <query>
    ), triple backticks, or
    ---
    separators. Place instructions before the data, not after.
  5. Specify output format explicitly — Tell the model exactly what format to use. If JSON, provide the schema or a filled example. If Markdown, show the heading structure. If a list, show how items should be formatted. Include a negative example if there is a common wrong format to avoid.
  6. Add few-shot examples — For classification or extraction tasks, include 2–5 examples in the prompt. Format them identically to the real input/output pair. Choose examples that cover edge cases and are representative of the real distribution.
  7. Iterate and test — Test on at least 10 representative inputs. Track: did the model follow the format? Did it hallucinate? Was the reasoning correct? Identify failure patterns and add instructions or examples to address them.
  8. Version and document the prompt — Save prompts in a template file with variable placeholders (
    {{input}}
    ). Document what model version it was tested on, what temperature, and what the expected pass rate is.
  9. Optimize for the target model — Different models respond differently: Claude prefers XML tags and explicit role instructions; GPT-4 responds well to numbered instructions; open-source models often need more explicit format constraints. Test the same prompt on the target model even if it worked on another.
  10. Add safety and fallback instructions — Include: what to do if the input is out of scope, how to handle ambiguous inputs, whether to ask for clarification or make a best-effort attempt, and how to indicate low confidence.
  1. 精准定义任务 — 用一句话定义任务:模型必须做什么、输入是什么、输出格式是什么。模糊的任务会产生模糊的输出。示例:“从以下财报电话会议记录中提取所有公司名称及其相关收入数据,并以JSON数组形式返回。”
  2. 选择提示词技术 — 根据任务复杂度选择:
    • 零样本(Zero-shot):简单且指令清晰的任务,无需示例。
    • 少样本(Few-shot):分类、格式化或风格类任务,提供2-5个带标签的示例。
    • 思维链(Chain-of-thought, CoT):数学、逻辑、多步骤推理任务,添加“逐步思考”指令。
    • 角色提示(Role prompting):需要专业知识或特定角色的任务,比如“你是一名资深税务律师……”
    • 自一致性(Self-consistency):对同一个思维链提示词运行N次,取多数投票结果作为答案。
    • 提示词链(Prompt chaining):将复杂任务分解为连续的提示词,每个提示词的输出作为下一个的输入。
  3. 编写系统提示词 — 对于支持系统提示词的API(OpenAI、Anthropic),在此处添加持久化指令:角色、输出格式、约束、不确定时的处理方式。除非任务确实需要更多上下文,否则请控制在500token以内。
  4. 结构化用户提示词 — 使用清晰的分隔符区分指令和输入数据,比如XML标签(
    <document>
    <query>
    )、三重反引号或
    ---
    分隔符。将指令放在数据之前,而非之后。
  5. 明确指定输出格式 — 准确告知模型要使用的格式。如果是JSON,提供schema或填充好的示例;如果是Markdown,展示标题结构;如果是列表,展示项目格式。如果有常见的错误格式需要避免,可以添加负面示例。
  6. 添加少样本示例 — 对于分类或提取任务,在提示词中包含2-5个示例,格式要与真实输入输出对完全一致。选择能覆盖边缘情况且代表真实数据分布的示例。
  7. 迭代与测试 — 至少用10个有代表性的输入进行测试。跟踪:模型是否遵循格式?是否产生幻觉?推理是否正确?识别失败模式并添加指令或示例来解决问题。
  8. 版本化并记录提示词 — 将提示词保存到带变量占位符(
    {{input}}
    )的模板文件中。记录测试时使用的模型版本、温度参数以及预期通过率。
  9. 针对目标模型优化 — 不同模型的响应不同:Claude偏好XML标签和明确的角色指令;GPT-4对编号指令响应良好;开源模型通常需要更明确的格式约束。即使提示词在其他模型上有效,也要在目标模型上测试。
  10. 添加安全与回退指令 — 包含:输入超出范围时的处理方式、如何处理模糊输入、是请求澄清还是尽力尝试,以及如何表示低置信度。

Examples

示例

Example 1: Improving a Vague Prompt to Get Structured JSON Output

示例1:优化模糊提示词以获取结构化JSON输出

Input:
Original vague prompt: "Summarize this customer review and tell me if it's positive or negative."
Output:
Improved prompt (system + user):
SYSTEM:
You are a sentiment analysis assistant. You always respond with valid JSON only — no prose, no markdown fences. If you cannot determine sentiment, set "sentiment" to "unknown" and explain in "reason".

USER:
Analyze the following customer review. Return a JSON object with exactly these fields:
{
  "sentiment": "positive" | "negative" | "neutral" | "unknown",
  "confidence": 0.0–1.0,
  "key_themes": ["theme1", "theme2"],
  "summary": "one sentence summary",
  "reason": "brief explanation of sentiment classification"
}

Review:
<review>
{{customer_review_text}}
</review>
Why this works:
  • System prompt enforces JSON-only output at the model level
  • Explicit schema with types eliminates ambiguity
  • XML tag; the model from treating the review text as instructions
  • "unknown"
    fallback; hallucinated confident answers
  • confidence
    field enables downstream filtering of low-quality classifications
输入:
原始模糊提示词:“总结这条客户评价,并告诉我它是正面还是负面的。”
输出:
优化后的提示词(系统+用户):
SYSTEM:
你是一名情感分析助手。你始终只返回有效的JSON——不返回散文、不返回Markdown围栏。如果无法确定情感,将"sentiment"设为"unknown"并在"reason"中说明。

USER:
分析以下客户评价。返回一个包含以下字段的JSON对象:
{
  "sentiment": "positive" | "negative" | "neutral" | "unknown",
  "confidence": 0.0–1.0,
  "key_themes": ["theme1", "theme2"],
  "summary": "一句话总结",
  "reason": "情感分类的简要说明"
}

评价:
<review>
{{customer_review_text}}
</review>
为何有效:
  • 系统提示词在模型层面强制要求仅输出JSON
  • 带类型的明确schema消除了歧义
  • XML标签分隔符防止模型将评价文本视为指令
  • "unknown"回退避免了产生虚假的高置信度答案
  • "confidence"字段支持下游过滤低质量分类结果

Example 2: Few-Shot Chain-of-Thought Prompt for Legal Clause Classification

示例2:用于法律条款分类的少样本思维链提示词

Input:
Task: Classify contract clauses as "liability-limiting", "indemnification", "IP-assignment", or "other".
The model keeps misclassifying indemnification clauses as liability-limiting.
Output:
Few-shot CoT system prompt:
SYSTEM:
You are a contract analysis assistant specializing in clause classification. Think through each clause step by step before classifying. Respond only in the format shown in the examples.

USER:
Classify each contract clause. Think step by step, then provide your classification.

---
EXAMPLE 1:
Clause: "Neither party shall be liable for indirect, incidental, or consequential damages arising out of this agreement."
Reasoning: This clause restricts what types of damages can be claimed. It limits liability exposure — it does not require one party to protect another. This is a liability cap, not an indemnification.
Classification: liability-limiting

---
EXAMPLE 2:
Clause: "Vendor shall indemnify, defend, and hold harmless Client from any third-party claims arising from Vendor's breach of this agreement."
Reasoning: This clause requires the Vendor to actively protect the Client from external claims. It creates an obligation to defend and compensate — this is indemnification, not merely limiting damages.
Classification: indemnification

---
EXAMPLE 3:
Clause: "All work product created by Contractor under this agreement is assigned to Client upon creation."
Reasoning: This clause transfers ownership of created works. It concerns who owns intellectual property, not liability or indemnification.
Classification: IP-assignment

---
Now classify this clause:
Clause: "{{clause_text}}"
Reasoning:
Why this works:
  • Chain-of-thought examples show the model the reasoning pattern to distinguish similar categories
  • The two easily-confused categories (liability vs indemnification) each get an explicit contrasting example
  • Format template forces reasoning before classification, reducing snap-judgment errors
  • Reasoning:
    at the end primes the model to complete the reasoning before the answer
输入:
任务:将合同条款分类为"liability-limiting(限制责任)"、"indemnification(赔偿保障)"、"IP-assignment(知识产权转让)"或"other(其他)"。
模型一直将赔偿保障条款误分类为限制责任。
输出:
少样本思维链系统提示词:
SYSTEM:
你是一名专注于条款分类的合同分析助手。在分类前请逐步思考每个条款。仅按照示例所示格式回复。

USER:
对每个合同条款进行分类。先逐步思考,再给出分类结果。

---
示例1:
条款:“任何一方均不对因本协议产生的间接、附带或 consequential damages( consequential损失)承担责任。”
推理:本条款限制了可主张的损失类型,它限制了责任风险——不要求一方保护另一方。这是责任上限,而非赔偿保障。
分类:liability-limiting

---
示例2:
条款:“供应商应赔偿、辩护并使客户免受因供应商违反本协议而产生的任何第三方索赔。”
推理:本条款要求供应商主动保护客户免受外部索赔,它设定了辩护和赔偿的义务——这是赔偿保障,而非仅仅限制损失。
分类:indemnification

---
示例3:
条款:“承包商根据本协议创作的所有工作成果在创作完成后即转让给客户。”
推理:本条款转让了创作成果的所有权,涉及知识产权归属,与责任或赔偿保障无关。
分类:IP-assignment

---
现在对以下条款进行分类:
条款:"{{clause_text}}"
推理:
为何有效:
  • 思维链示例向模型展示了区分相似类别的推理模式
  • 两个易混淆的类别(责任限制vs赔偿保障)各有一个明确的对比示例
  • 格式模板强制要求先推理再分类,减少了仓促判断的错误
  • 末尾的“推理:”引导模型在给出答案前先完成推理过程

Best Practices

最佳实践

  • Always test prompts on at least 10 real inputs before declaring them production-ready
  • Version prompts in code just like application code — breaking changes in prompts are real bugs
  • Use the lowest effective temperature: 0.0 for deterministic extraction, 0.7–1.0 for creative tasks
  • Prefer XML tags over triple backticks as delimiters — they're less likely to appear in real input
  • Put the most important instructions at the beginning AND end of the prompt (primacy + recency effect)
  • When using few-shot examples, ensure they cover the edge cases you care about most
  • Keep system prompts focused — every sentence should earn its token budget
  • Use "You must" and "Always" for hard constraints; use "prefer" or "try to" for soft preferences
  • 在宣布提示词可用于生产前,始终用至少10个真实输入进行测试
  • 像对待应用代码一样对提示词进行版本控制——提示词的破坏性变更属于真实bug
  • 使用最低的有效温度参数:0.0用于确定性提取,0.7-1.0用于创意任务
  • 优先使用XML标签而非三重反引号作为分隔符——它们在真实输入中出现的概率更低
  • 将最重要的指令放在提示词的开头和结尾(首因+近因效应)
  • 使用少样本示例时,确保覆盖你最关心的边缘情况
  • 保持系统提示词聚焦——每一句话都要值得消耗token
  • 对硬性约束使用“必须”和“始终”;对软性偏好使用“偏好”或“尽量”

Common Mistakes

常见错误

  • Giving vague instructions like "be helpful and accurate" without specifying what accuracy means
  • Not specifying what to do when the model is uncertain — it will hallucinate a confident answer
  • Using too many few-shot examples (>10) which can cause the model to pattern-match instead of reason
  • Forgetting to test the prompt on the actual target model — prompts are not portable across models
  • Putting instructions after the input data — models weight early context more heavily
  • Asking multiple distinct tasks in one prompt — split into separate prompts or clearly numbered steps
  • Assuming the same prompt works at different temperatures — always co-tune prompt and temperature
  • 给出模糊的指令,比如“要有用且准确”,却不定义准确性的标准
  • 未指定模型不确定时的处理方式——它会产生虚假的高置信度答案
  • 使用过多的少样本示例(>10个)——这会导致模型模式匹配而非推理
  • 忘记在实际目标模型上测试提示词——提示词无法在不同模型间通用
  • 将指令放在输入数据之后——模型更重视早期上下文
  • 在一个提示词中要求多个不同的任务——拆分为单独的提示词或明确编号的步骤
  • 假设同一个提示词在不同温度参数下都有效——始终要同时调整提示词和温度参数

Tips & Tricks

技巧与窍门

  • Add "Do not add any text before or after the JSON" to enforce clean parseable JSON output
  • Use "If you are unsure, say 'I don't know' rather than guessing" to reduce hallucination
  • For long documents, use "Here is the most relevant section:" before the content to focus attention
  • Chain-of-thought is most effective in the middle of a response — put format instructions last
  • "Let's think step by step" reliably improves math and logic; for simpler tasks it wastes tokens
  • Test prompt robustness by deliberately injecting adversarial inputs (e.g., "ignore previous instructions")
  • Use Anthropic's Constitutional AI principles or OpenAI's system prompt best practices as references
  • 添加“不要在JSON前后添加任何文本”以确保输出可干净解析
  • 使用“如果不确定,请说‘我不知道’而非猜测”来减少幻觉输出
  • 对于长文档,在内容前添加“以下是最相关的部分:”来引导模型注意力
  • 思维链在响应中间时效果最佳——将格式指令放在最后
  • “让我们逐步思考”能可靠地提升数学和逻辑任务的表现;对于简单任务则会浪费token
  • 通过故意注入对抗性输入(比如“忽略之前的指令”)来测试提示词的鲁棒性
  • 参考Anthropic的Constitutional AI原则或OpenAI的系统提示词最佳实践

Related Skills

相关技能

  • dataset-curator
  • eval-designer
  • model-comparator
  • dataset-curator
  • eval-designer
  • model-comparator