prompt-engineer-toolkit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Prompt Engineer Toolkit

Prompt Engineer工具包

Overview

概述

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

使用该技能将提示词从临时草稿转变为具备可重复测试、版本控制和回归安全性的生产级资产。它强调可衡量的质量而非直觉判断。适用于以下场景：推出需要可靠输出的新LLM功能时；模型或指令变更后提示词质量下降时；多名团队成员编辑提示词且需要历史记录/差异对比时；需要基于证据选择提示词用于生产部署时；或者希望在不同环境中实现一致的提示词管控时。

Core Capabilities

核心功能

A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates

针对结构化测试用例进行提示词A/B测试评估
针对合规性、相关性和安全性检查进行量化评分
具备不可变历史记录和变更日志的提示词版本追踪
用于审查影响行为的编辑操作的提示词差异对比
可复用的提示词模板及选择指南
适用于模型/提示词更新的回归友好型工作流

Key Workflows

关键工作流

1. Run Prompt A/B Test

1. 运行提示词A/B测试

Prepare JSON test cases and run:

bash

python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text

Input can also come from stdin/

--input

JSON payload.

准备JSON测试用例并运行：

bash

python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text

输入也可来自标准输入或

--input

JSON负载。

2. Choose Winner With Evidence

2. 基于证据选择最优提示词

The tester scores outputs per case and aggregates:

expected content coverage
forbidden content violations
regex/format compliance
output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

测试工具会针对每个用例对输出进行评分并汇总：

预期内容覆盖度
违禁内容违规情况
正则表达式/格式合规性
输出长度合理性

将得分更高的提示词作为候选基线，然后运行回归测试套件。

3. Version Prompts

3. 提示词版本控制

bash

undefined

bash

undefined

Add version

添加版本

python3 scripts/prompt_versioner.py add
--name support_classifier
--prompt-file prompts/support_v3.txt
--author alice

Diff versions

对比版本差异

python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3

Changelog

查看变更日志

python3 scripts/prompt_versioner.py changelog --name support_classifier

undefined

python3 scripts/prompt_versioner.py changelog --name support_classifier

undefined

4. Regression Loop

4. 回归测试循环

Store baseline version.
Propose prompt edits.
Re-run A/B test.
Promote only if score and safety constraints improve.

存储基线版本。
提出提示词编辑建议。
重新运行A/B测试。
仅当评分和安全约束得到改善时才推广新版本。

Script Interfaces

脚本接口

```
python3 scripts/prompt_tester.py --help
```
- Reads prompts/cases from stdin or
```
--input
```
- Optional external runner command
- Emits text or JSON metrics
```
python3 scripts/prompt_versioner.py --help
```
- Manages prompt history (
```
add
```
  ,
```
list
```
  ,
```
diff
```
  ,
```
changelog
```
  )
- Stores metadata and content snapshots locally

```
python3 scripts/prompt_tester.py --help
```
- 从标准输入或
```
--input
```
  读取提示词/测试用例
- 可选的外部运行器命令
- 输出文本或JSON格式的指标
```
python3 scripts/prompt_versioner.py --help
```
- 管理提示词历史记录（
```
add
```
  、
```
list
```
  、
```
diff
```
  、
```
changelog
```
  ）
- 本地存储元数据和内容快照

Pitfalls, Best Practices & Review Checklist

常见陷阱、最佳实践与审查清单

Avoid these mistakes:

Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
Changing prompt and model simultaneously — always isolate variables.
Missing
```
must_not_contain
```
(forbidden-content) checks in evaluation criteria.
Editing prompts without version metadata, author, or change rationale.
Skipping semantic diffs before deploying a new prompt version.
Optimizing one benchmark while harming edge cases — track the full suite.
Model swap without rerunning the baseline A/B suite.

Before promoting any prompt, confirm:

Task intent is explicit and unambiguous.
Output schema/format is explicit.
Safety and exclusion constraints are explicit.
No contradictory instructions.
No unnecessary verbosity tokens.
A/B score improves and violation count stays at zero.

需避免的错误：

仅根据单个用例的输出选择提示词——应使用包含真实场景和边缘案例的测试套件。
同时修改提示词和模型——始终要隔离变量。
评估标准中缺少
```
must_not_contain
```
（违禁内容）检查。
编辑提示词时未添加版本元数据、作者信息或变更理由。
部署新提示词版本前跳过语义差异对比。
优化某一基准指标时损害了边缘案例的表现——需跟踪整个测试套件的情况。
更换模型时未重新运行基线A/B测试套件。

推广任何提示词前，请确认：

任务意图明确且无歧义。
输出模式/格式明确。
安全和排除约束明确。
无矛盾的指令。
无不必要的冗余标记。
A/B测试得分提升且违规次数保持为零。

References

参考资料

references/prompt-templates.md
references/technique-guide.md
references/evaluation-rubric.md
README.md

references/prompt-templates.md
references/technique-guide.md
references/evaluation-rubric.md
README.md

Evaluation Design

评估设计

Each test case should define:

```
input
```
: realistic production-like input
```
expected_contains
```
: required markers/content
```
forbidden_contains
```
: disallowed phrases or unsafe content
```
expected_regex
```
: required structural patterns

This enables deterministic grading across prompt variants.

每个测试用例应定义：

```
input
```
: 贴近生产环境的真实输入
```
expected_contains
```
: 必需的标记/内容
```
forbidden_contains
```
: 禁用短语或不安全内容
```
expected_regex
```
: 必需的结构模式

这使得不同提示词变体之间的评分具备确定性。

Versioning Policy

版本控制策略

Use semantic prompt identifiers per feature (
```
support_classifier
```
,
```
ad_copy_shortform
```
).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.

为每个功能使用语义化提示词标识符（如
```
support_classifier
```
、
```
ad_copy_shortform
```
）。
为每个修订记录作者信息和变更说明。
绝不覆盖历史版本。
将新提示词推广到生产环境前进行差异对比。

Rollout Strategy

部署策略

Create baseline prompt version.
Propose candidate prompt.
Run A/B suite against same cases.
Promote only if winner improves average and keeps violation count at zero.
Track post-release feedback and feed new failure cases back into test suite.

创建提示词基线版本。
提出候选提示词。
针对相同用例运行A/B测试套件。
仅当最优版本提升了平均得分且违规次数保持为零时才进行推广。
跟踪发布后的反馈，并将新的失败案例反馈到测试套件中。