senior-prompt-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Senior Prompt Engineer

高级Prompt工程师

Overview

概述

Design, test, and optimize prompts for large language models. This skill covers systematic prompt engineering including few-shot example design, chain-of-thought reasoning, system prompt architecture, structured output specification, parameter tuning, evaluation methodology, A/B testing, and prompt version management.
Announce at start: "I'm using the senior-prompt-engineer skill for prompt design and optimization."

为大语言模型设计、测试和优化prompt。本技能覆盖系统化的prompt工程流程,包括少样本示例设计、Chain-of-Thought推理、系统prompt架构设计、结构化输出规范、参数调优、评估方法论、A/B测试以及prompt版本管理。
开篇声明: "我正在使用senior-prompt-engineer技能进行prompt设计和优化。"

Phase 1: Requirements

第一阶段:需求梳理

Goal: Define the task objective, quality criteria, and constraints before writing any prompt.
目标: 在编写任何prompt之前明确定义任务目标、质量标准和约束条件。

Actions

执行动作

  1. Define the task objective clearly
  2. Identify input/output format requirements
  3. Determine quality criteria (accuracy, tone, format)
  4. Assess edge cases and failure modes
  5. Choose model and parameter constraints
  1. 清晰定义任务目标
  2. 明确输入/输出格式要求
  3. 确定质量标准(准确率、语气、格式)
  4. 评估边缘场景和失败模式
  5. 选定模型及参数约束

STOP — Do NOT proceed to Phase 2 until:

停止条件 — 满足以下要求前禁止进入第二阶段:

  • Task objective is stated in one sentence
  • Input format and output format are defined
  • Quality criteria are measurable
  • Edge cases are listed
  • Model selection is justified

  • 任务目标可通过一句话清晰表述
  • 输入格式和输出格式已明确定义
  • 质量标准可量化
  • 边缘场景已全部罗列
  • 模型选型有合理依据

Phase 2: Prompt Design

第二阶段:Prompt设计

Goal: Draft the prompt with proper architecture, examples, and constraints.
目标: 按照合理的架构、示例和约束完成prompt初稿。

Actions

执行动作

  1. Draft system prompt with role, constraints, and format
  2. Design few-shot examples (3-5 representative cases)
  3. Add chain-of-thought scaffolding if reasoning is needed
  4. Specify output structure (JSON, markdown, etc.)
  5. Add error handling instructions
  1. 编写包含角色、约束和格式要求的系统prompt初稿
  2. 设计少样本示例(3-5个代表性案例)
  3. 若任务需要推理能力,添加Chain-of-Thought引导结构
  4. 明确输出结构(JSON、markdown等)
  5. 添加错误处理指令

Prompt Architecture Layers

Prompt架构分层

LayerPurposeExample
1. IdentityWho the model is"You are a sentiment classifier..."
2. ContextWhat it knows/has access to"You have access to product reviews..."
3. TaskWhat to do"Classify each review as positive/negative/neutral"
4. ConstraintsWhat NOT to do"Never include PII in output"
5. FormatHow to structure output"Respond in JSON: {classification, confidence}"
6. ExamplesDemonstrations3-5 representative input/output pairs
7. MetacognitionHandling uncertainty"If uncertain, classify as neutral and explain"
分层作用示例
1. 身份设定明确模型的角色"You are a sentiment classifier..."
2. 上下文信息明确模型可获取的信息范围"You have access to product reviews..."
3. 任务说明明确需要完成的任务"Classify each review as positive/negative/neutral"
4. 约束条件明确禁止的行为"Never include PII in output"
5. 格式要求明确输出的结构规范"Respond in JSON: {classification, confidence}"
6. 参考示例任务执行的演示案例3-5个有代表性的输入/输出对
7. 元认知规则不确定性场景的处理方式"If uncertain, classify as neutral and explain"

System Prompt Template

系统Prompt模板

[Role] You are a [specific role] that [specific capability].

[Context] You have access to [tools/knowledge]. The user will provide [input type].

[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]

[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]

[Output Format]
Respond in the following format:
[format specification]

[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>
[Role] You are a [specific role] that [specific capability].

[Context] You have access to [tools/knowledge]. The user will provide [input type].

[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]

[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]

[Output Format]
Respond in the following format:
[format specification]

[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>

STOP — Do NOT proceed to Phase 3 until:

停止条件 — 满足以下要求前禁止进入第三阶段:

  • All 7 layers are addressed (or intentionally omitted with rationale)
  • Examples are representative and diverse
  • Output format is unambiguous
  • Constraints are specific (not vague)

  • 已覆盖全部7个架构分层(若有意省略需提供合理理由)
  • 示例具备代表性和多样性
  • 输出格式无歧义
  • 约束条件具体明确(无模糊表述)

Phase 3: Evaluation and Iteration

第三阶段:评估与迭代

Goal: Measure prompt quality and iterate toward targets.
目标: 量化prompt质量,迭代优化直至达到目标要求。

Actions

执行动作

  1. Create evaluation dataset (50+ examples minimum)
  2. Define scoring rubric (automated + human metrics)
  3. Run baseline evaluation
  4. Iterate on prompt with targeted improvements
  5. A/B test promising variants
  6. Version and document the final prompt
  1. 构建评估数据集(最少50个示例)
  2. 制定评分规则(自动化+人工评估指标)
  3. 运行基线评估
  4. 针对问题点迭代优化prompt
  5. 对表现优异的变体进行A/B测试
  6. 对最终版prompt进行版本管理并归档

STOP — Evaluation complete when:

停止条件 — 满足以下要求时评估完成:

  • Evaluation dataset covers all input categories
  • Metrics meet defined quality thresholds
  • A/B test shows statistical significance (p < 0.05)
  • Final prompt is versioned with metrics

  • 评估数据集覆盖所有输入类别
  • 指标达到预设的质量阈值
  • A/B测试结果具备统计显著性(p < 0.05)
  • 最终版prompt已完成版本归档并附带评估指标

Few-Shot Example Design

少样本示例设计

Selection Criteria Decision Table

筛选标准决策表

CriterionExplanationExample
RepresentativeCover most common input typesInclude typical emails, not just edge cases
DiverseInclude edge cases and boundariesShort + long, positive + negative
OrderedSimple to complex progressionObvious case first, ambiguous last
BalancedEqual representation of categoriesNot 4 positive and 1 negative
标准说明示例
代表性覆盖绝大多数常见输入类型包含典型邮件,而非仅边缘场景案例
多样性包含边缘场景和边界值短文本+长文本,正面+负面
排序逻辑按照从简单到复杂的顺序排列先放明确的案例,最后放模糊的案例
平衡性各分类的示例数量均衡避免出现4个正面1个负面的情况

Example Count Guidelines

示例数量指引

Task ComplexityExamples Needed
Simple classification2-3
Moderate generation3-5
Complex reasoning5-8
Format-sensitive3-5 (focus on format consistency)
任务复杂度需要的示例数量
简单分类2-3
中等难度生成3-5
复杂推理5-8
格式敏感任务3-5(重点保证格式一致性)

Example Format

示例格式

<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>

<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>

Chain-of-Thought Patterns

Chain-of-Thought模式

CoT Pattern Decision Table

CoT模式决策表

PatternUse WhenExample
Standard CoTMulti-step reasoning"Think step by step: 1. Identify... 2. Analyze..."
Structured CoTNeed parseable reasoningXML tags:
<analysis>...</analysis>
then
<answer>...</answer>
Self-ConsistencyHigh-stakes decisionsGenerate 3 solutions, pick most common
No CoTSimple factual lookups, format conversionSkip reasoning overhead
模式适用场景示例
标准CoT多步推理任务"Think step by step: 1. Identify... 2. Analyze..."
结构化CoT需要可解析的推理过程XML标签:
<analysis>...</analysis>
then
<answer>...</answer>
自一致性CoT高风险决策场景生成3个解决方案,选择出现频率最高的结果
无CoT简单事实查询、格式转换任务省略推理环节减少开销

When to Use CoT

CoT适用场景

Task TypeUse CoT?Rationale
Mathematical reasoningYesStep-by-step prevents errors
Multi-step logicYesMakes reasoning transparent
Classification with justificationYesImproves accuracy and explainability
Simple factual lookupNoAdds latency without accuracy gain
Direct format conversionNoNo reasoning needed
Very short responsesNoCoT overhead exceeds benefit

任务类型是否使用CoT?理由
数学推理分步推理可减少错误
多步逻辑任务让推理过程透明可追溯
需要说明依据的分类任务提升准确率和可解释性
简单事实查询增加延迟但不会提升准确率
直接格式转换无需推理过程
极短回复任务CoT的开销超过收益

Structured Output

结构化输出

Output Format Decision Table

输出格式决策表

FormatUse WhenParsing
JSONMachine-consumed output
JSON.parse()
MarkdownHuman-readable structured textRegex or markdown parser
XML tagsSections need clear boundariesXML parser or regex
YAMLConfiguration-like outputYAML parser
Plain textSimple, unstructured responseNo parsing needed
格式适用场景解析方式
JSON供机器读取的输出
JSON.parse()
Markdown人工阅读的结构化文本正则表达式或markdown解析器
XML标签需要明确边界的内容分段XML解析器或正则表达式
YAML类配置的输出内容YAML解析器
纯文本简单无结构化回复无需解析

JSON Output Example

JSON输出示例

Respond with a JSON object matching this schema:
{
  "classification": "positive" | "negative" | "neutral",
  "confidence": number between 0 and 1,
  "reasoning": "brief explanation",
  "key_phrases": ["array", "of", "phrases"]
}

Do not include any text outside the JSON object.

Respond with a JSON object matching this schema:
{
  "classification": "positive" | "negative" | "neutral",
  "confidence": number between 0 and 1,
  "reasoning": "brief explanation",
  "key_phrases": ["array", "of", "phrases"]
}

Do not include any text outside the JSON object.

Temperature and Top-P Tuning

Temperature和Top-P参数调优

Use CaseTemperatureTop-PRationale
Code generation0.0-0.20.9Deterministic, correct
Classification0.01.0Consistent results
Creative writing0.7-1.00.95Diverse, interesting
Summarization0.2-0.40.9Faithful but fluent
Brainstorming0.8-1.20.95Maximum diversity
Data extraction0.00.9Precise, reliable
场景Temperature取值Top-P取值理由
代码生成0.0-0.20.9保证结果确定性、正确性
分类任务0.01.0保证结果一致性
创意写作0.7-1.00.95输出多样、有新意
摘要生成0.2-0.40.9忠实原文同时表述流畅
头脑风暴0.8-1.20.95最大化输出多样性
数据提取0.00.9结果精确、可靠

Rules

调优规则

  • Temperature 0 for tasks requiring consistency and correctness
  • Higher temperature for creative tasks
  • Top-P rarely needs tuning (keep at 0.9-1.0)
  • Do not use both high temperature AND low top-p (contradictory)

  • 要求一致性和正确性的任务Temperature设为0
  • 创意类任务使用更高的Temperature
  • Top-P通常无需调整(保持在0.9-1.0即可)
  • 不要同时使用高Temperature和低Top-P(参数逻辑矛盾)

Evaluation Metrics

评估指标

Automated Metrics

自动化指标

MetricMeasuresUse For
Exact MatchOutput equals expectedClassification, extraction
F1 ScorePrecision + recall balanceMulti-label tasks
BLEU/ROUGEN-gram overlapSummarization, translation
JSON validityParseable structured outputStructured generation
Regex matchOutput matches patternFormat compliance
指标衡量维度适用场景
完全匹配率输出与预期结果完全一致的比例分类、信息提取任务
F1分数精确率和召回率的平衡多标签任务
BLEU/ROUGEN-gram重合度摘要生成、翻译任务
JSON有效性结构化输出是否可正常解析结构化生成任务
正则匹配输出是否符合预设模式格式合规性检查

Human Evaluation Dimensions

人工评估维度

DimensionScaleDescription
Accuracy1-5Factual correctness
Relevance1-5Addresses the actual question
Coherence1-5Logical flow and structure
Completeness1-5Covers all required aspects
Tone1-5Matches desired voice
Conciseness1-5No unnecessary content
维度评分范围说明
准确率1-5事实正确性
相关性1-5是否回应了实际问题
连贯性1-5逻辑通顺性和结构合理性
完整性1-5是否覆盖所有要求的内容
语气1-5是否符合预期的表达风格
简洁性1-5无冗余内容

Evaluation Dataset Requirements

评估数据集要求

  • Minimum 50 examples for statistical significance
  • Cover all input categories proportionally
  • Include edge cases (10-20% of dataset)
  • Gold labels reviewed by 2+ evaluators
  • Version-controlled alongside prompts

  • 最少包含50个示例以保证统计显著性
  • 按比例覆盖所有输入类别
  • 包含边缘场景(占数据集的10-20%)
  • 金标准标注需经过至少2名评估人员审核
  • 与prompt一同进行版本管理

A/B Testing Process

A/B测试流程

  1. Define hypothesis: "Prompt B will improve [metric] by [amount]"
  2. Hold all variables constant except the prompt change
  3. Run both variants on the same evaluation set
  4. Calculate metric differences with confidence intervals
  5. Require statistical significance (p < 0.05) before adopting
  1. 定义假设:"Prompt B可以将[指标]提升[幅度]"
  2. 保持所有变量不变,仅修改prompt内容
  3. 在同一个评估数据集上运行两个prompt变体
  4. 计算指标差异及置信区间
  5. 只有当结果具备统计显著性(p < 0.05)时才采用新变体

What to A/B Test

可A/B测试的变量

VariableExpected Impact
Instruction phrasing (imperative vs descriptive)Moderate
Number of few-shot examplesModerate
Example orderingLow-moderate
CoT presence/absenceHigh for reasoning tasks
Output format specificationHigh for structured output
Constraint placement (beginning vs end)Low

变量预期影响
指令表述(祈使句 vs 描述句)中等
少样本示例数量中等
示例排序低-中等
是否添加CoT推理类任务影响高
输出格式规范结构化输出任务影响高
约束条件放置位置(开头vs结尾)

Prompt Versioning

Prompt版本管理

Version File Format

版本文件格式

yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
  accuracy: 0.94
  f1: 0.92
  eval_dataset: sentiment-eval-v3
system_prompt: |
  You are a sentiment classifier...
examples:
  - input: "..."
    output: "..."
yaml
id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
  accuracy: 0.94
  f1: 0.92
  eval_dataset: sentiment-eval-v3
system_prompt: |
  You are a sentiment classifier...
examples:
  - input: "..."
    output: "..."

Versioning Rules

版本管理规则

  • Semantic versioning: major.minor (major = behavior change, minor = refinement)
  • Every version includes evaluation metrics
  • Link to evaluation dataset version
  • Document what changed and why
  • Keep previous versions for rollback

  • 语义化版本号:主版本号.次版本号(主版本号=行为变更,次版本号=优化调整)
  • 每个版本都要附带评估指标
  • 关联对应评估数据集的版本
  • 记录变更内容和变更原因
  • 保留历史版本以便回滚

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It Is WrongCorrect Approach
Vague instructions ("be helpful")Unreliable, inconsistent outputSpecific instructions with examples
Contradictory constraintsModel cannot satisfy bothReview for consistency
Examples that do not match taskConfuses the modelExamples must reflect real use
Over-engineering simple tasksWasted tokens, slowerMatch prompt complexity to task complexity
No evaluation frameworkGuessing at qualityDefine metrics before iterating
Optimizing for single exampleOverfitting to one caseOptimize for the distribution
Assuming cross-model portabilityDifferent models need different promptsTest on target model
Skipping version controlCannot rollback or compareVersion every prompt with metrics

反模式问题说明正确做法
模糊指令(如"请提供帮助")输出不可靠、一致性差给出具体指令并附带示例
约束条件矛盾模型无法同时满足所有要求检查约束的一致性
示例与任务不匹配干扰模型判断示例必须反映真实使用场景
简单任务过度设计浪费token、响应变慢prompt复杂度与任务复杂度匹配
无评估框架无法客观判断质量迭代前先定义评估指标
针对单个示例优化过拟合单个案例针对整体数据分布优化
假设prompt可跨模型通用不同模型需要适配不同的prompt在目标模型上测试验证
跳过版本控制无法回滚或对比效果每个版本的prompt都要附带指标归档

Integration Points

集成对接点

SkillRelationship
llm-as-judge
LLM-as-judge evaluates prompt output quality
acceptance-testing
Prompt evaluation datasets serve as acceptance tests
testing-strategy
Prompt testing follows the evaluation methodology
senior-data-scientist
Statistical testing validates A/B results
code-review
Prompt changes reviewed like code changes
clean-code
Prompt readability follows clean code naming principles

技能关联关系
llm-as-judge
LLM-as-judge可用于评估prompt输出质量
acceptance-testing
Prompt评估数据集可作为验收测试用例
testing-strategy
Prompt测试遵循评估方法论要求
senior-data-scientist
统计测试可验证A/B测试结果的有效性
code-review
Prompt变更需要像代码变更一样进行评审
clean-code
Prompt可读性遵循干净代码的命名原则

Skill Type

技能类型

FLEXIBLE — Adapt prompting techniques to the specific model, task, and quality requirements. The evaluation and versioning practices are strongly recommended but can be scaled to project size. Always version production prompts.
灵活适配型 — 可根据具体模型、任务和质量要求调整prompt技术。评估和版本管理实践强烈推荐执行,可根据项目规模缩放适配。生产环境使用的prompt必须进行版本管理。