senior-prompt-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSenior Prompt Engineer
高级Prompt工程师
Overview
概述
Design, test, and optimize prompts for large language models. This skill covers systematic prompt engineering including few-shot example design, chain-of-thought reasoning, system prompt architecture, structured output specification, parameter tuning, evaluation methodology, A/B testing, and prompt version management.
Announce at start: "I'm using the senior-prompt-engineer skill for prompt design and optimization."
为大语言模型设计、测试和优化prompt。本技能覆盖系统化的prompt工程流程,包括少样本示例设计、Chain-of-Thought推理、系统prompt架构设计、结构化输出规范、参数调优、评估方法论、A/B测试以及prompt版本管理。
开篇声明: "我正在使用senior-prompt-engineer技能进行prompt设计和优化。"
Phase 1: Requirements
第一阶段:需求梳理
Goal: Define the task objective, quality criteria, and constraints before writing any prompt.
目标: 在编写任何prompt之前明确定义任务目标、质量标准和约束条件。
Actions
执行动作
- Define the task objective clearly
- Identify input/output format requirements
- Determine quality criteria (accuracy, tone, format)
- Assess edge cases and failure modes
- Choose model and parameter constraints
- 清晰定义任务目标
- 明确输入/输出格式要求
- 确定质量标准(准确率、语气、格式)
- 评估边缘场景和失败模式
- 选定模型及参数约束
STOP — Do NOT proceed to Phase 2 until:
停止条件 — 满足以下要求前禁止进入第二阶段:
- Task objective is stated in one sentence
- Input format and output format are defined
- Quality criteria are measurable
- Edge cases are listed
- Model selection is justified
- 任务目标可通过一句话清晰表述
- 输入格式和输出格式已明确定义
- 质量标准可量化
- 边缘场景已全部罗列
- 模型选型有合理依据
Phase 2: Prompt Design
第二阶段:Prompt设计
Goal: Draft the prompt with proper architecture, examples, and constraints.
目标: 按照合理的架构、示例和约束完成prompt初稿。
Actions
执行动作
- Draft system prompt with role, constraints, and format
- Design few-shot examples (3-5 representative cases)
- Add chain-of-thought scaffolding if reasoning is needed
- Specify output structure (JSON, markdown, etc.)
- Add error handling instructions
- 编写包含角色、约束和格式要求的系统prompt初稿
- 设计少样本示例(3-5个代表性案例)
- 若任务需要推理能力,添加Chain-of-Thought引导结构
- 明确输出结构(JSON、markdown等)
- 添加错误处理指令
Prompt Architecture Layers
Prompt架构分层
| Layer | Purpose | Example |
|---|---|---|
| 1. Identity | Who the model is | "You are a sentiment classifier..." |
| 2. Context | What it knows/has access to | "You have access to product reviews..." |
| 3. Task | What to do | "Classify each review as positive/negative/neutral" |
| 4. Constraints | What NOT to do | "Never include PII in output" |
| 5. Format | How to structure output | "Respond in JSON: {classification, confidence}" |
| 6. Examples | Demonstrations | 3-5 representative input/output pairs |
| 7. Metacognition | Handling uncertainty | "If uncertain, classify as neutral and explain" |
| 分层 | 作用 | 示例 |
|---|---|---|
| 1. 身份设定 | 明确模型的角色 | "You are a sentiment classifier..." |
| 2. 上下文信息 | 明确模型可获取的信息范围 | "You have access to product reviews..." |
| 3. 任务说明 | 明确需要完成的任务 | "Classify each review as positive/negative/neutral" |
| 4. 约束条件 | 明确禁止的行为 | "Never include PII in output" |
| 5. 格式要求 | 明确输出的结构规范 | "Respond in JSON: {classification, confidence}" |
| 6. 参考示例 | 任务执行的演示案例 | 3-5个有代表性的输入/输出对 |
| 7. 元认知规则 | 不确定性场景的处理方式 | "If uncertain, classify as neutral and explain" |
System Prompt Template
系统Prompt模板
[Role] You are a [specific role] that [specific capability].
[Context] You have access to [tools/knowledge]. The user will provide [input type].
[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]
[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]
[Output Format]
Respond in the following format:
[format specification]
[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>[Role] You are a [specific role] that [specific capability].
[Context] You have access to [tools/knowledge]. The user will provide [input type].
[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]
[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]
[Output Format]
Respond in the following format:
[format specification]
[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>STOP — Do NOT proceed to Phase 3 until:
停止条件 — 满足以下要求前禁止进入第三阶段:
- All 7 layers are addressed (or intentionally omitted with rationale)
- Examples are representative and diverse
- Output format is unambiguous
- Constraints are specific (not vague)
- 已覆盖全部7个架构分层(若有意省略需提供合理理由)
- 示例具备代表性和多样性
- 输出格式无歧义
- 约束条件具体明确(无模糊表述)
Phase 3: Evaluation and Iteration
第三阶段:评估与迭代
Goal: Measure prompt quality and iterate toward targets.
目标: 量化prompt质量,迭代优化直至达到目标要求。
Actions
执行动作
- Create evaluation dataset (50+ examples minimum)
- Define scoring rubric (automated + human metrics)
- Run baseline evaluation
- Iterate on prompt with targeted improvements
- A/B test promising variants
- Version and document the final prompt
- 构建评估数据集(最少50个示例)
- 制定评分规则(自动化+人工评估指标)
- 运行基线评估
- 针对问题点迭代优化prompt
- 对表现优异的变体进行A/B测试
- 对最终版prompt进行版本管理并归档
STOP — Evaluation complete when:
停止条件 — 满足以下要求时评估完成:
- Evaluation dataset covers all input categories
- Metrics meet defined quality thresholds
- A/B test shows statistical significance (p < 0.05)
- Final prompt is versioned with metrics
- 评估数据集覆盖所有输入类别
- 指标达到预设的质量阈值
- A/B测试结果具备统计显著性(p < 0.05)
- 最终版prompt已完成版本归档并附带评估指标
Few-Shot Example Design
少样本示例设计
Selection Criteria Decision Table
筛选标准决策表
| Criterion | Explanation | Example |
|---|---|---|
| Representative | Cover most common input types | Include typical emails, not just edge cases |
| Diverse | Include edge cases and boundaries | Short + long, positive + negative |
| Ordered | Simple to complex progression | Obvious case first, ambiguous last |
| Balanced | Equal representation of categories | Not 4 positive and 1 negative |
| 标准 | 说明 | 示例 |
|---|---|---|
| 代表性 | 覆盖绝大多数常见输入类型 | 包含典型邮件,而非仅边缘场景案例 |
| 多样性 | 包含边缘场景和边界值 | 短文本+长文本,正面+负面 |
| 排序逻辑 | 按照从简单到复杂的顺序排列 | 先放明确的案例,最后放模糊的案例 |
| 平衡性 | 各分类的示例数量均衡 | 避免出现4个正面1个负面的情况 |
Example Count Guidelines
示例数量指引
| Task Complexity | Examples Needed |
|---|---|
| Simple classification | 2-3 |
| Moderate generation | 3-5 |
| Complex reasoning | 5-8 |
| Format-sensitive | 3-5 (focus on format consistency) |
| 任务复杂度 | 需要的示例数量 |
|---|---|
| 简单分类 | 2-3 |
| 中等难度生成 | 3-5 |
| 复杂推理 | 5-8 |
| 格式敏感任务 | 3-5(重点保证格式一致性) |
Example Format
示例格式
<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example><example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>Chain-of-Thought Patterns
Chain-of-Thought模式
CoT Pattern Decision Table
CoT模式决策表
| Pattern | Use When | Example |
|---|---|---|
| Standard CoT | Multi-step reasoning | "Think step by step: 1. Identify... 2. Analyze..." |
| Structured CoT | Need parseable reasoning | XML tags: |
| Self-Consistency | High-stakes decisions | Generate 3 solutions, pick most common |
| No CoT | Simple factual lookups, format conversion | Skip reasoning overhead |
| 模式 | 适用场景 | 示例 |
|---|---|---|
| 标准CoT | 多步推理任务 | "Think step by step: 1. Identify... 2. Analyze..." |
| 结构化CoT | 需要可解析的推理过程 | XML标签: |
| 自一致性CoT | 高风险决策场景 | 生成3个解决方案,选择出现频率最高的结果 |
| 无CoT | 简单事实查询、格式转换任务 | 省略推理环节减少开销 |
When to Use CoT
CoT适用场景
| Task Type | Use CoT? | Rationale |
|---|---|---|
| Mathematical reasoning | Yes | Step-by-step prevents errors |
| Multi-step logic | Yes | Makes reasoning transparent |
| Classification with justification | Yes | Improves accuracy and explainability |
| Simple factual lookup | No | Adds latency without accuracy gain |
| Direct format conversion | No | No reasoning needed |
| Very short responses | No | CoT overhead exceeds benefit |
| 任务类型 | 是否使用CoT? | 理由 |
|---|---|---|
| 数学推理 | 是 | 分步推理可减少错误 |
| 多步逻辑任务 | 是 | 让推理过程透明可追溯 |
| 需要说明依据的分类任务 | 是 | 提升准确率和可解释性 |
| 简单事实查询 | 否 | 增加延迟但不会提升准确率 |
| 直接格式转换 | 否 | 无需推理过程 |
| 极短回复任务 | 否 | CoT的开销超过收益 |
Structured Output
结构化输出
Output Format Decision Table
输出格式决策表
| Format | Use When | Parsing |
|---|---|---|
| JSON | Machine-consumed output | |
| Markdown | Human-readable structured text | Regex or markdown parser |
| XML tags | Sections need clear boundaries | XML parser or regex |
| YAML | Configuration-like output | YAML parser |
| Plain text | Simple, unstructured response | No parsing needed |
| 格式 | 适用场景 | 解析方式 |
|---|---|---|
| JSON | 供机器读取的输出 | |
| Markdown | 人工阅读的结构化文本 | 正则表达式或markdown解析器 |
| XML标签 | 需要明确边界的内容分段 | XML解析器或正则表达式 |
| YAML | 类配置的输出内容 | YAML解析器 |
| 纯文本 | 简单无结构化回复 | 无需解析 |
JSON Output Example
JSON输出示例
Respond with a JSON object matching this schema:
{
"classification": "positive" | "negative" | "neutral",
"confidence": number between 0 and 1,
"reasoning": "brief explanation",
"key_phrases": ["array", "of", "phrases"]
}
Do not include any text outside the JSON object.Respond with a JSON object matching this schema:
{
"classification": "positive" | "negative" | "neutral",
"confidence": number between 0 and 1,
"reasoning": "brief explanation",
"key_phrases": ["array", "of", "phrases"]
}
Do not include any text outside the JSON object.Temperature and Top-P Tuning
Temperature和Top-P参数调优
| Use Case | Temperature | Top-P | Rationale |
|---|---|---|---|
| Code generation | 0.0-0.2 | 0.9 | Deterministic, correct |
| Classification | 0.0 | 1.0 | Consistent results |
| Creative writing | 0.7-1.0 | 0.95 | Diverse, interesting |
| Summarization | 0.2-0.4 | 0.9 | Faithful but fluent |
| Brainstorming | 0.8-1.2 | 0.95 | Maximum diversity |
| Data extraction | 0.0 | 0.9 | Precise, reliable |
| 场景 | Temperature取值 | Top-P取值 | 理由 |
|---|---|---|---|
| 代码生成 | 0.0-0.2 | 0.9 | 保证结果确定性、正确性 |
| 分类任务 | 0.0 | 1.0 | 保证结果一致性 |
| 创意写作 | 0.7-1.0 | 0.95 | 输出多样、有新意 |
| 摘要生成 | 0.2-0.4 | 0.9 | 忠实原文同时表述流畅 |
| 头脑风暴 | 0.8-1.2 | 0.95 | 最大化输出多样性 |
| 数据提取 | 0.0 | 0.9 | 结果精确、可靠 |
Rules
调优规则
- Temperature 0 for tasks requiring consistency and correctness
- Higher temperature for creative tasks
- Top-P rarely needs tuning (keep at 0.9-1.0)
- Do not use both high temperature AND low top-p (contradictory)
- 要求一致性和正确性的任务Temperature设为0
- 创意类任务使用更高的Temperature
- Top-P通常无需调整(保持在0.9-1.0即可)
- 不要同时使用高Temperature和低Top-P(参数逻辑矛盾)
Evaluation Metrics
评估指标
Automated Metrics
自动化指标
| Metric | Measures | Use For |
|---|---|---|
| Exact Match | Output equals expected | Classification, extraction |
| F1 Score | Precision + recall balance | Multi-label tasks |
| BLEU/ROUGE | N-gram overlap | Summarization, translation |
| JSON validity | Parseable structured output | Structured generation |
| Regex match | Output matches pattern | Format compliance |
| 指标 | 衡量维度 | 适用场景 |
|---|---|---|
| 完全匹配率 | 输出与预期结果完全一致的比例 | 分类、信息提取任务 |
| F1分数 | 精确率和召回率的平衡 | 多标签任务 |
| BLEU/ROUGE | N-gram重合度 | 摘要生成、翻译任务 |
| JSON有效性 | 结构化输出是否可正常解析 | 结构化生成任务 |
| 正则匹配 | 输出是否符合预设模式 | 格式合规性检查 |
Human Evaluation Dimensions
人工评估维度
| Dimension | Scale | Description |
|---|---|---|
| Accuracy | 1-5 | Factual correctness |
| Relevance | 1-5 | Addresses the actual question |
| Coherence | 1-5 | Logical flow and structure |
| Completeness | 1-5 | Covers all required aspects |
| Tone | 1-5 | Matches desired voice |
| Conciseness | 1-5 | No unnecessary content |
| 维度 | 评分范围 | 说明 |
|---|---|---|
| 准确率 | 1-5 | 事实正确性 |
| 相关性 | 1-5 | 是否回应了实际问题 |
| 连贯性 | 1-5 | 逻辑通顺性和结构合理性 |
| 完整性 | 1-5 | 是否覆盖所有要求的内容 |
| 语气 | 1-5 | 是否符合预期的表达风格 |
| 简洁性 | 1-5 | 无冗余内容 |
Evaluation Dataset Requirements
评估数据集要求
- Minimum 50 examples for statistical significance
- Cover all input categories proportionally
- Include edge cases (10-20% of dataset)
- Gold labels reviewed by 2+ evaluators
- Version-controlled alongside prompts
- 最少包含50个示例以保证统计显著性
- 按比例覆盖所有输入类别
- 包含边缘场景(占数据集的10-20%)
- 金标准标注需经过至少2名评估人员审核
- 与prompt一同进行版本管理
A/B Testing Process
A/B测试流程
- Define hypothesis: "Prompt B will improve [metric] by [amount]"
- Hold all variables constant except the prompt change
- Run both variants on the same evaluation set
- Calculate metric differences with confidence intervals
- Require statistical significance (p < 0.05) before adopting
- 定义假设:"Prompt B可以将[指标]提升[幅度]"
- 保持所有变量不变,仅修改prompt内容
- 在同一个评估数据集上运行两个prompt变体
- 计算指标差异及置信区间
- 只有当结果具备统计显著性(p < 0.05)时才采用新变体
What to A/B Test
可A/B测试的变量
| Variable | Expected Impact |
|---|---|
| Instruction phrasing (imperative vs descriptive) | Moderate |
| Number of few-shot examples | Moderate |
| Example ordering | Low-moderate |
| CoT presence/absence | High for reasoning tasks |
| Output format specification | High for structured output |
| Constraint placement (beginning vs end) | Low |
| 变量 | 预期影响 |
|---|---|
| 指令表述(祈使句 vs 描述句) | 中等 |
| 少样本示例数量 | 中等 |
| 示例排序 | 低-中等 |
| 是否添加CoT | 推理类任务影响高 |
| 输出格式规范 | 结构化输出任务影响高 |
| 约束条件放置位置(开头vs结尾) | 低 |
Prompt Versioning
Prompt版本管理
Version File Format
版本文件格式
yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
accuracy: 0.94
f1: 0.92
eval_dataset: sentiment-eval-v3
system_prompt: |
You are a sentiment classifier...
examples:
- input: "..."
output: "..."yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
accuracy: 0.94
f1: 0.92
eval_dataset: sentiment-eval-v3
system_prompt: |
You are a sentiment classifier...
examples:
- input: "..."
output: "..."Versioning Rules
版本管理规则
- Semantic versioning: major.minor (major = behavior change, minor = refinement)
- Every version includes evaluation metrics
- Link to evaluation dataset version
- Document what changed and why
- Keep previous versions for rollback
- 语义化版本号:主版本号.次版本号(主版本号=行为变更,次版本号=优化调整)
- 每个版本都要附带评估指标
- 关联对应评估数据集的版本
- 记录变更内容和变更原因
- 保留历史版本以便回滚
Anti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Vague instructions ("be helpful") | Unreliable, inconsistent output | Specific instructions with examples |
| Contradictory constraints | Model cannot satisfy both | Review for consistency |
| Examples that do not match task | Confuses the model | Examples must reflect real use |
| Over-engineering simple tasks | Wasted tokens, slower | Match prompt complexity to task complexity |
| No evaluation framework | Guessing at quality | Define metrics before iterating |
| Optimizing for single example | Overfitting to one case | Optimize for the distribution |
| Assuming cross-model portability | Different models need different prompts | Test on target model |
| Skipping version control | Cannot rollback or compare | Version every prompt with metrics |
| 反模式 | 问题说明 | 正确做法 |
|---|---|---|
| 模糊指令(如"请提供帮助") | 输出不可靠、一致性差 | 给出具体指令并附带示例 |
| 约束条件矛盾 | 模型无法同时满足所有要求 | 检查约束的一致性 |
| 示例与任务不匹配 | 干扰模型判断 | 示例必须反映真实使用场景 |
| 简单任务过度设计 | 浪费token、响应变慢 | prompt复杂度与任务复杂度匹配 |
| 无评估框架 | 无法客观判断质量 | 迭代前先定义评估指标 |
| 针对单个示例优化 | 过拟合单个案例 | 针对整体数据分布优化 |
| 假设prompt可跨模型通用 | 不同模型需要适配不同的prompt | 在目标模型上测试验证 |
| 跳过版本控制 | 无法回滚或对比效果 | 每个版本的prompt都要附带指标归档 |
Integration Points
集成对接点
| Skill | Relationship |
|---|---|
| LLM-as-judge evaluates prompt output quality |
| Prompt evaluation datasets serve as acceptance tests |
| Prompt testing follows the evaluation methodology |
| Statistical testing validates A/B results |
| Prompt changes reviewed like code changes |
| Prompt readability follows clean code naming principles |
| 技能 | 关联关系 |
|---|---|
| LLM-as-judge可用于评估prompt输出质量 |
| Prompt评估数据集可作为验收测试用例 |
| Prompt测试遵循评估方法论要求 |
| 统计测试可验证A/B测试结果的有效性 |
| Prompt变更需要像代码变更一样进行评审 |
| Prompt可读性遵循干净代码的命名原则 |
Skill Type
技能类型
FLEXIBLE — Adapt prompting techniques to the specific model, task, and quality requirements. The evaluation and versioning practices are strongly recommended but can be scaled to project size. Always version production prompts.
灵活适配型 — 可根据具体模型、任务和质量要求调整prompt技术。评估和版本管理实践强烈推荐执行,可根据项目规模缩放适配。生产环境使用的prompt必须进行版本管理。