senior-prompt-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Senior Prompt Engineer

高级Prompt工程师

Overview

概述

Design, test, and optimize prompts for large language models. This skill covers systematic prompt engineering including few-shot example design, chain-of-thought reasoning, system prompt architecture, structured output specification, parameter tuning, evaluation methodology, A/B testing, and prompt version management.

Announce at start: "I'm using the senior-prompt-engineer skill for prompt design and optimization."

为大语言模型设计、测试和优化prompt。本技能覆盖系统化的prompt工程流程，包括少样本示例设计、Chain-of-Thought推理、系统prompt架构设计、结构化输出规范、参数调优、评估方法论、A/B测试以及prompt版本管理。

开篇声明： "我正在使用senior-prompt-engineer技能进行prompt设计和优化。"

Phase 1: Requirements

第一阶段：需求梳理

Goal: Define the task objective, quality criteria, and constraints before writing any prompt.

目标： 在编写任何prompt之前明确定义任务目标、质量标准和约束条件。

Actions

执行动作

Define the task objective clearly
Identify input/output format requirements
Determine quality criteria (accuracy, tone, format)
Assess edge cases and failure modes
Choose model and parameter constraints

清晰定义任务目标
明确输入/输出格式要求
确定质量标准（准确率、语气、格式）
评估边缘场景和失败模式
选定模型及参数约束

STOP — Do NOT proceed to Phase 2 until:

停止条件 — 满足以下要求前禁止进入第二阶段：

Phase 2: Prompt Design

第二阶段：Prompt设计

Goal: Draft the prompt with proper architecture, examples, and constraints.

目标： 按照合理的架构、示例和约束完成prompt初稿。

Actions

执行动作

Draft system prompt with role, constraints, and format
Design few-shot examples (3-5 representative cases)
Add chain-of-thought scaffolding if reasoning is needed
Specify output structure (JSON, markdown, etc.)
Add error handling instructions

编写包含角色、约束和格式要求的系统prompt初稿
设计少样本示例（3-5个代表性案例）
若任务需要推理能力，添加Chain-of-Thought引导结构
明确输出结构（JSON、markdown等）
添加错误处理指令

Prompt Architecture Layers

Prompt架构分层

Layer	Purpose	Example
1. Identity	Who the model is	"You are a sentiment classifier..."
2. Context	What it knows/has access to	"You have access to product reviews..."
3. Task	What to do	"Classify each review as positive/negative/neutral"
4. Constraints	What NOT to do	"Never include PII in output"
5. Format	How to structure output	"Respond in JSON: {classification, confidence}"
6. Examples	Demonstrations	3-5 representative input/output pairs
7. Metacognition	Handling uncertainty	"If uncertain, classify as neutral and explain"

分层	作用	示例
1. 身份设定	明确模型的角色	"You are a sentiment classifier..."
2. 上下文信息	明确模型可获取的信息范围	"You have access to product reviews..."
3. 任务说明	明确需要完成的任务	"Classify each review as positive/negative/neutral"
4. 约束条件	明确禁止的行为	"Never include PII in output"
5. 格式要求	明确输出的结构规范	"Respond in JSON: {classification, confidence}"
6. 参考示例	任务执行的演示案例	3-5个有代表性的输入/输出对
7. 元认知规则	不确定性场景的处理方式	"If uncertain, classify as neutral and explain"

System Prompt Template

系统Prompt模板

[Role] You are a [specific role] that [specific capability].

[Context] You have access to [tools/knowledge]. The user will provide [input type].

[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]

[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]

[Output Format]
Respond in the following format:
[format specification]

[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>

[Role] You are a [specific role] that [specific capability].

[Context] You have access to [tools/knowledge]. The user will provide [input type].

[Instructions]
1. First, [step 1]
2. Then, [step 2]
3. Finally, [step 3]

[Constraints]
- Always [requirement]
- Never [prohibition]
- If uncertain, [fallback behavior]

[Output Format]
Respond in the following format:
[format specification]

[Examples]
<example>
Input: [sample input]
Output: [sample output]
</example>

STOP — Do NOT proceed to Phase 3 until:

停止条件 — 满足以下要求前禁止进入第三阶段：

All 7 layers are addressed (or intentionally omitted with rationale)
Examples are representative and diverse
Output format is unambiguous
Constraints are specific (not vague)

已覆盖全部7个架构分层（若有意省略需提供合理理由）
示例具备代表性和多样性
输出格式无歧义
约束条件具体明确（无模糊表述）

Phase 3: Evaluation and Iteration

第三阶段：评估与迭代

Goal: Measure prompt quality and iterate toward targets.

目标： 量化prompt质量，迭代优化直至达到目标要求。

Actions

执行动作

Create evaluation dataset (50+ examples minimum)
Define scoring rubric (automated + human metrics)
Run baseline evaluation
Iterate on prompt with targeted improvements
A/B test promising variants
Version and document the final prompt

构建评估数据集（最少50个示例）
制定评分规则（自动化+人工评估指标）
运行基线评估
针对问题点迭代优化prompt
对表现优异的变体进行A/B测试
对最终版prompt进行版本管理并归档

STOP — Evaluation complete when:

停止条件 — 满足以下要求时评估完成：

Evaluation dataset covers all input categories
Metrics meet defined quality thresholds
A/B test shows statistical significance (p < 0.05)
Final prompt is versioned with metrics

评估数据集覆盖所有输入类别
指标达到预设的质量阈值
A/B测试结果具备统计显著性（p < 0.05）
最终版prompt已完成版本归档并附带评估指标

Few-Shot Example Design

少样本示例设计

Selection Criteria Decision Table

筛选标准决策表

Criterion	Explanation	Example
Representative	Cover most common input types	Include typical emails, not just edge cases
Diverse	Include edge cases and boundaries	Short + long, positive + negative
Ordered	Simple to complex progression	Obvious case first, ambiguous last
Balanced	Equal representation of categories	Not 4 positive and 1 negative

标准	说明	示例
代表性	覆盖绝大多数常见输入类型	包含典型邮件，而非仅边缘场景案例
多样性	包含边缘场景和边界值	短文本+长文本，正面+负面
排序逻辑	按照从简单到复杂的顺序排列	先放明确的案例，最后放模糊的案例
平衡性	各分类的示例数量均衡	避免出现4个正面1个负面的情况

Example Count Guidelines

示例数量指引

Task Complexity	Examples Needed
Simple classification	2-3
Moderate generation	3-5
Complex reasoning	5-8
Format-sensitive	3-5 (focus on format consistency)

任务复杂度	需要的示例数量
简单分类	2-3
中等难度生成	3-5
复杂推理	5-8
格式敏感任务	3-5（重点保证格式一致性）

Example Format

示例格式

<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>

<example>
<input>
[Representative input]
</input>
<reasoning>
[Optional: show the thinking process]
</reasoning>
<output>
[Expected output in exact target format]
</output>
</example>

Chain-of-Thought Patterns

Chain-of-Thought模式

CoT Pattern Decision Table

CoT模式决策表

Pattern	Use When	Example
Standard CoT	Multi-step reasoning	"Think step by step: 1. Identify... 2. Analyze..."
Structured CoT	Need parseable reasoning	XML tags: `<analysis>...</analysis>` then `<answer>...</answer>`
Self-Consistency	High-stakes decisions	Generate 3 solutions, pick most common
No CoT	Simple factual lookups, format conversion	Skip reasoning overhead

模式	适用场景	示例
标准CoT	多步推理任务	"Think step by step: 1. Identify... 2. Analyze..."
结构化CoT	需要可解析的推理过程	XML标签： `<analysis>...</analysis>` then `<answer>...</answer>`
自一致性CoT	高风险决策场景	生成3个解决方案，选择出现频率最高的结果
无CoT	简单事实查询、格式转换任务	省略推理环节减少开销

When to Use CoT

CoT适用场景

Task Type	Use CoT?	Rationale
Mathematical reasoning	Yes	Step-by-step prevents errors
Multi-step logic	Yes	Makes reasoning transparent
Classification with justification	Yes	Improves accuracy and explainability
Simple factual lookup	No	Adds latency without accuracy gain
Direct format conversion	No	No reasoning needed
Very short responses	No	CoT overhead exceeds benefit

任务类型	是否使用CoT？	理由
数学推理	是	分步推理可减少错误
多步逻辑任务	是	让推理过程透明可追溯
需要说明依据的分类任务	是	提升准确率和可解释性
简单事实查询	否	增加延迟但不会提升准确率
直接格式转换	否	无需推理过程
极短回复任务	否	CoT的开销超过收益

Structured Output

结构化输出

Output Format Decision Table

输出格式决策表

Format	Use When	Parsing
JSON	Machine-consumed output	`JSON.parse()`
Markdown	Human-readable structured text	Regex or markdown parser
XML tags	Sections need clear boundaries	XML parser or regex
YAML	Configuration-like output	YAML parser
Plain text	Simple, unstructured response	No parsing needed

格式	适用场景	解析方式
JSON	供机器读取的输出	`JSON.parse()`
Markdown	人工阅读的结构化文本	正则表达式或markdown解析器
XML标签	需要明确边界的内容分段	XML解析器或正则表达式
YAML	类配置的输出内容	YAML解析器
纯文本	简单无结构化回复	无需解析

JSON Output Example

JSON输出示例

Respond with a JSON object matching this schema:
{
  "classification": "positive" | "negative" | "neutral",
  "confidence": number between 0 and 1,
  "reasoning": "brief explanation",
  "key_phrases": ["array", "of", "phrases"]
}

Do not include any text outside the JSON object.

Respond with a JSON object matching this schema:
{
  "classification": "positive" | "negative" | "neutral",
  "confidence": number between 0 and 1,
  "reasoning": "brief explanation",
  "key_phrases": ["array", "of", "phrases"]
}

Do not include any text outside the JSON object.

Temperature and Top-P Tuning

Temperature和Top-P参数调优

Use Case	Temperature	Top-P	Rationale
Code generation	0.0-0.2	0.9	Deterministic, correct
Classification	0.0	1.0	Consistent results
Creative writing	0.7-1.0	0.95	Diverse, interesting
Summarization	0.2-0.4	0.9	Faithful but fluent
Brainstorming	0.8-1.2	0.95	Maximum diversity
Data extraction	0.0	0.9	Precise, reliable

场景	Temperature取值	Top-P取值	理由
代码生成	0.0-0.2	0.9	保证结果确定性、正确性
分类任务	0.0	1.0	保证结果一致性
创意写作	0.7-1.0	0.95	输出多样、有新意
摘要生成	0.2-0.4	0.9	忠实原文同时表述流畅
头脑风暴	0.8-1.2	0.95	最大化输出多样性
数据提取	0.0	0.9	结果精确、可靠

Rules

调优规则

Temperature 0 for tasks requiring consistency and correctness
Higher temperature for creative tasks
Top-P rarely needs tuning (keep at 0.9-1.0)
Do not use both high temperature AND low top-p (contradictory)

要求一致性和正确性的任务Temperature设为0
创意类任务使用更高的Temperature
Top-P通常无需调整（保持在0.9-1.0即可）
不要同时使用高Temperature和低Top-P（参数逻辑矛盾）

Evaluation Metrics

评估指标

Automated Metrics

自动化指标

Metric	Measures	Use For
Exact Match	Output equals expected	Classification, extraction
F1 Score	Precision + recall balance	Multi-label tasks
BLEU/ROUGE	N-gram overlap	Summarization, translation
JSON validity	Parseable structured output	Structured generation
Regex match	Output matches pattern	Format compliance

指标	衡量维度	适用场景
完全匹配率	输出与预期结果完全一致的比例	分类、信息提取任务
F1分数	精确率和召回率的平衡	多标签任务
BLEU/ROUGE	N-gram重合度	摘要生成、翻译任务
JSON有效性	结构化输出是否可正常解析	结构化生成任务
正则匹配	输出是否符合预设模式	格式合规性检查

Human Evaluation Dimensions

人工评估维度

Dimension	Scale	Description
Accuracy	1-5	Factual correctness
Relevance	1-5	Addresses the actual question
Coherence	1-5	Logical flow and structure
Completeness	1-5	Covers all required aspects
Tone	1-5	Matches desired voice
Conciseness	1-5	No unnecessary content

维度	评分范围	说明
准确率	1-5	事实正确性
相关性	1-5	是否回应了实际问题
连贯性	1-5	逻辑通顺性和结构合理性
完整性	1-5	是否覆盖所有要求的内容
语气	1-5	是否符合预期的表达风格
简洁性	1-5	无冗余内容

Evaluation Dataset Requirements

评估数据集要求

Minimum 50 examples for statistical significance
Cover all input categories proportionally
Include edge cases (10-20% of dataset)
Gold labels reviewed by 2+ evaluators
Version-controlled alongside prompts

最少包含50个示例以保证统计显著性
按比例覆盖所有输入类别
包含边缘场景（占数据集的10-20%）
金标准标注需经过至少2名评估人员审核
与prompt一同进行版本管理

A/B Testing Process

A/B测试流程

Define hypothesis: "Prompt B will improve [metric] by [amount]"
Hold all variables constant except the prompt change
Run both variants on the same evaluation set
Calculate metric differences with confidence intervals
Require statistical significance (p < 0.05) before adopting

定义假设："Prompt B可以将[指标]提升[幅度]"
保持所有变量不变，仅修改prompt内容
在同一个评估数据集上运行两个prompt变体
计算指标差异及置信区间
只有当结果具备统计显著性（p < 0.05）时才采用新变体

What to A/B Test

可A/B测试的变量

Variable	Expected Impact
Instruction phrasing (imperative vs descriptive)	Moderate
Number of few-shot examples	Moderate
Example ordering	Low-moderate
CoT presence/absence	High for reasoning tasks
Output format specification	High for structured output
Constraint placement (beginning vs end)	Low

变量	预期影响
指令表述（祈使句 vs 描述句）	中等
少样本示例数量	中等
示例排序	低-中等
是否添加CoT	推理类任务影响高
输出格式规范	结构化输出任务影响高
约束条件放置位置（开头vs结尾）	低

Prompt Versioning

Prompt版本管理

Version File Format

版本文件格式

yaml

id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
  accuracy: 0.94
  f1: 0.92
  eval_dataset: sentiment-eval-v3
system_prompt: |
  You are a sentiment classifier...
examples:
  - input: "..."
    output: "..."

yaml

id: classify-sentiment
version: 2.1
model: claude-sonnet-4-20250514
temperature: 0.0
created: 2025-03-01
author: team
changelog: "Added edge case examples for sarcasm detection"
metrics:
  accuracy: 0.94
  f1: 0.92
  eval_dataset: sentiment-eval-v3
system_prompt: |
  You are a sentiment classifier...
examples:
  - input: "..."
    output: "..."

Versioning Rules

版本管理规则

Semantic versioning: major.minor (major = behavior change, minor = refinement)
Every version includes evaluation metrics
Link to evaluation dataset version
Document what changed and why
Keep previous versions for rollback

语义化版本号：主版本号.次版本号（主版本号=行为变更，次版本号=优化调整）
每个版本都要附带评估指标
关联对应评估数据集的版本
记录变更内容和变更原因
保留历史版本以便回滚

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Is Wrong	Correct Approach
Vague instructions ("be helpful")	Unreliable, inconsistent output	Specific instructions with examples
Contradictory constraints	Model cannot satisfy both	Review for consistency
Examples that do not match task	Confuses the model	Examples must reflect real use
Over-engineering simple tasks	Wasted tokens, slower	Match prompt complexity to task complexity
No evaluation framework	Guessing at quality	Define metrics before iterating
Optimizing for single example	Overfitting to one case	Optimize for the distribution
Assuming cross-model portability	Different models need different prompts	Test on target model
Skipping version control	Cannot rollback or compare	Version every prompt with metrics

反模式	问题说明	正确做法
模糊指令（如"请提供帮助"）	输出不可靠、一致性差	给出具体指令并附带示例
约束条件矛盾	模型无法同时满足所有要求	检查约束的一致性
示例与任务不匹配	干扰模型判断	示例必须反映真实使用场景
简单任务过度设计	浪费token、响应变慢	prompt复杂度与任务复杂度匹配
无评估框架	无法客观判断质量	迭代前先定义评估指标
针对单个示例优化	过拟合单个案例	针对整体数据分布优化
假设prompt可跨模型通用	不同模型需要适配不同的prompt	在目标模型上测试验证
跳过版本控制	无法回滚或对比效果	每个版本的prompt都要附带指标归档

Integration Points

集成对接点

Skill	Relationship
`llm-as-judge`	LLM-as-judge evaluates prompt output quality
`acceptance-testing`	Prompt evaluation datasets serve as acceptance tests
`testing-strategy`	Prompt testing follows the evaluation methodology
`senior-data-scientist`	Statistical testing validates A/B results
`code-review`	Prompt changes reviewed like code changes
`clean-code`	Prompt readability follows clean code naming principles

技能	关联关系
`llm-as-judge`	LLM-as-judge可用于评估prompt输出质量
`acceptance-testing`	Prompt评估数据集可作为验收测试用例
`testing-strategy`	Prompt测试遵循评估方法论要求
`senior-data-scientist`	统计测试可验证A/B测试结果的有效性
`code-review`	Prompt变更需要像代码变更一样进行评审
`clean-code`	Prompt可读性遵循干净代码的命名原则

Skill Type

技能类型

FLEXIBLE — Adapt prompting techniques to the specific model, task, and quality requirements. The evaluation and versioning practices are strongly recommended but can be scaled to project size. Always version production prompts.

灵活适配型 — 可根据具体模型、任务和质量要求调整prompt技术。评估和版本管理实践强烈推荐执行，可根据项目规模缩放适配。生产环境使用的prompt必须进行版本管理。