quality-playbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQuality Playbook Generator
质量手册生成器
Generate a complete quality system tailored to a specific codebase. Unlike test stub generators that work mechanically from source code, this skill explores the project first — understanding its domain, architecture, specifications, and failure history — then produces a quality playbook grounded in what it finds.
为特定代码库量身打造完整的质量体系。与仅从源代码机械生成测试桩的工具不同,本技能会先探索项目——理解其领域、架构、规格和故障历史——再基于发现的内容生成质量手册。
Why This Exists
设计初衷
Most software projects have tests, but few have a quality system. Tests check whether code works. A quality system answers harder questions: what does "working correctly" mean for this specific project? What are the ways it could fail that wouldn't be caught by tests? What should every developer (human or AI) know before touching this code?
Without a quality playbook, every new contributor (and every new AI session) starts from scratch — guessing at what matters, writing tests that look good but don't catch real bugs, and rediscovering failure modes that were already found and fixed months ago. A quality playbook makes the bar explicit, persistent, and inherited.
大多数软件项目都有测试,但很少有真正的质量体系。测试仅能验证代码是否可运行,而质量体系则要解答更深入的问题:对于这个特定项目而言,“正确运行”的定义是什么?有哪些故障模式是测试无法覆盖的?每位开发者(人类或AI)在接触代码前必须了解哪些信息?
如果没有质量手册,每位新贡献者(以及每个新的AI会话)都要从零开始——猜测项目的关键所在,编写看似规范却无法发现真实bug的测试,重复数月前就已发现并修复的故障模式。质量手册则明确了质量标准,使其持久化并可传承。
What This Skill Produces
本技能生成的内容
Six files that together form a repeatable quality system:
| File | Purpose | Why It Matters | Executes Code? |
|---|---|---|---|
| Quality constitution — coverage targets, fitness-to-purpose scenarios, theater prevention | Every AI session reads this first. It tells them what "good enough" means so they don't guess. | No |
| Automated functional tests derived from specifications | The safety net. Tests tied to what the spec says should happen, not just what the code does. Use the project's language: | Yes |
| Code review protocol with guardrails that prevent hallucinated findings | AI code reviews without guardrails produce confident but wrong findings. The guardrails (line numbers, grep before claiming, read bodies) often improve accuracy. | No |
| Integration test protocol — end-to-end pipeline across all variants | Unit tests pass, but does the system actually work end-to-end with real external services? | Yes |
| Council of Three multi-model spec audit protocol | No single AI model catches everything. Three independent models with different blind spots catch defects that any one alone would miss. | No |
| Bootstrap context for any AI session working on this project | The "read this first" file. Without it, AI sessions waste their first hour figuring out what's going on. | No |
Plus output directories: , , .
quality/code_reviews/quality/spec_audits/quality/results/The critical deliverable is the functional test file (named for the project's language and test framework conventions). The Markdown protocols are documentation for humans and AI agents. The functional tests are the automated safety net.
六类文件共同构成可重复的质量体系:
| 文件 | 用途 | 重要性 | 是否执行代码? |
|---|---|---|---|
| 质量章程——覆盖目标、适用性(fitness-to-purpose)场景、无效覆盖(coverage theater)预防机制 | 所有AI会话都会优先读取此文件,明确“足够好”的标准,避免主观猜测 | 否 |
| 基于规格的自动化功能测试 | 安全防线。测试与规格要求绑定,而非仅验证代码当前行为。根据项目语言命名: | 是 |
| 带防护规则的代码审查协议,防止生成错误结论 | 无防护规则的AI代码审查会产生看似可信但错误的结论。防护规则(行号要求、先搜索再断言、读取函数体)可大幅提升准确性 | 否 |
| 集成测试协议——跨所有变体的端到端流水线 | 单元测试通过不代表系统能与真实外部服务端到端正常工作 | 是 |
| Council of Three多模型规格审核协议 | 单一AI模型无法覆盖所有盲区,三个独立模型可发现单个模型遗漏的缺陷 | 否 |
| 针对项目的AI会话引导文件 | “必读”文件。没有它,AI会话会浪费首个小时梳理项目背景 | 否 |
此外还会生成输出目录:、、。
quality/code_reviews/quality/spec_audits/quality/results/最核心的交付物是功能测试文件(根据项目语言和测试框架规范命名)。Markdown协议是供人类和AI Agent阅读的文档,而功能测试则是自动化的安全防线。
How to Use
使用方法
Point this skill at any codebase:
Generate a quality playbook for this project.Update the functional tests — the quality playbook already exists.Run the spec audit protocol.If a quality playbook already exists (, functional tests, etc.), read the existing files first, then evaluate them against the self-check benchmarks in the verification phase. Don't assume existing files are complete — treat them as a starting point.
quality/QUALITY.md将本技能指向任意代码库:
为该项目生成质量手册。更新功能测试——质量手册已存在。执行规格审核协议。如果质量手册已存在(、功能测试等),请先读取现有文件,再根据验证阶段的自检基准进行评估。不要假设现有文件已完善——将其视为起点。
quality/QUALITY.mdPhase 1: Explore the Codebase (Do Not Write Yet)
阶段1:探索代码库(暂不编写内容)
Spend the first phase understanding the project. The quality playbook must be grounded in this specific codebase — not generic advice.
Why explore first? The most common failure in AI-generated quality playbooks is producing generic content — coverage targets that could apply to any project, scenarios that describe theoretical failures, tests that exercise language builtins instead of project code. Exploration prevents this by forcing every output to reference something real: a specific function, a specific schema, a specific defensive code pattern. If you can't point to where something lives in the code, you're guessing — and guesses produce quality playbooks nobody trusts.
Scaling for large codebases: For projects with more than ~50 source files, don't try to read everything. Focus exploration on the 3–5 core modules (the ones that handle the primary data flow, the most complex logic, and the most failure-prone operations). Read representative tests from each subsystem rather than every test file. The goal is depth on what matters, not breadth across everything.
第一阶段专注于理解项目。质量手册必须基于特定代码库——而非通用建议。
为何先探索? AI生成质量手册最常见的失败是产出通用内容——适用于所有项目的覆盖目标、描述理论故障的场景、仅测试语言内置功能而非项目代码的测试。探索可避免此类问题,迫使所有输出都引用真实内容:特定函数、特定 schema、特定防御性代码模式。如果无法指出内容在代码中的位置,就是在主观猜测——而猜测会生成无人信任的质量手册。
针对大型代码库的优化: 对于超过约50个源文件的项目,无需尝试阅读所有内容。重点探索3-5个核心模块(处理主要数据流、最复杂逻辑、最易故障的模块)。读取每个子系统的代表性测试,而非所有测试文件。目标是深入了解关键部分,而非全面覆盖所有内容。
Step 0: Ask About Development History
步骤0:询问开发历史
Before exploring code, ask the user one question:
"Do you have exported AI chat history from developing this project — Claude exports, Gemini takeouts, ChatGPT exports, Claude Code transcripts, or similar? If so, point me to the folder. The design discussions, incident reports, and quality decisions in those chats will make the generated quality playbook significantly better."
If the user provides a chat history folder:
- Scan for an index file first. Look for files named ,
INDEX*,CONTEXT.md, or similar navigation aids. If one exists, read it — it will tell you what's there and how to find things.README.md - Search for quality-relevant conversations. Look for messages mentioning: quality, testing, coverage, bugs, failures, incidents, crashes, validation, retry, recovery, spec, fitness, audit, review. Also search for the project name.
- Extract design decisions and incident history. The most valuable content is: (a) incident reports — what went wrong, how many records affected, how it was detected, (b) design discussions — why a particular approach was chosen, what alternatives were rejected, (c) quality framework discussions — coverage targets, testing philosophy, model review experiences, (d) cross-model feedback — where different AI models disagreed about the code.
- Don't try to read everything. Chat histories can be enormous. Use the index to find the most relevant conversations, then search within those for quality-related content. 10 minutes of targeted searching beats 2 hours of exhaustive reading.
This context is gold. A chat history where the developer discussed "why we chose this concurrency model" or "the time we lost 1,693 records in production" transforms generic scenarios into authoritative ones.
If the user doesn't have chat history, proceed normally — the skill works without it, just with less context.
在探索代码前,先向用户提问:
“你是否有该项目的AI聊天历史导出文件——Claude导出、Gemini备份、ChatGPT导出、Claude Code记录或类似文件?如果有,请指向对应的文件夹。这些聊天中的设计讨论、事件报告和质量决策会大幅提升生成的质量手册质量。”
如果用户提供聊天历史文件夹:
- 先扫描索引文件。查找名为、
INDEX*、CONTEXT.md或类似导航辅助文件。如果存在,先读取它——它会告诉你文件夹内容和查找方式。README.md - 搜索与质量相关的对话。查找提及以下关键词的消息:quality、testing、coverage、bugs、failures、incidents、crashes、validation、retry、recovery、spec、fitness、audit、review。同时搜索项目名称。
- 提取设计决策和事件历史。最有价值的内容包括:(a) 事件报告——故障详情、影响记录数、检测方式;(b) 设计讨论——为何选择特定方案、拒绝了哪些替代方案;(c) 质量框架讨论——覆盖目标、测试理念、模型审查经验;(d) 跨模型反馈——不同AI模型对代码的分歧点。
- 无需阅读所有内容。聊天历史可能非常庞大。使用索引找到最相关的对话,再在其中搜索质量相关内容。10分钟针对性搜索胜过2小时 exhaustive 阅读。
这些背景信息至关重要。包含开发者讨论“为何选择此并发模型”或“那次生产环境丢失1693条记录”的聊天历史,可将通用场景转化为权威场景。
如果用户没有聊天历史,正常继续——本技能无需此信息也能工作,只是背景信息会减少。
Step 1: Identify Domain, Stack, and Specifications
步骤1:识别领域、技术栈和规格
Read the README, existing documentation, and build config ( / / ). Answer:
pyproject.tomlpackage.jsonCargo.toml- What does this project do? (One sentence.)
- What language and key dependencies?
- What external systems does it talk to?
- What is the primary output?
Find the specifications. Specs are the source of truth for functional tests. Search in order: / in root, , , , , , , then files in root. Record the paths.
AGENTS.mdCLAUDE.mdspecs/docs/spec/design/architecture/adr/.mdIf no formal spec documents exist, the skill still works — but you need to assemble requirements from other sources. In order of preference:
- Ask the user — they often know the requirements even if they're not written down.
- README and inline documentation — many projects embed requirements in their README, API docs, or code comments.
- Existing test suite — tests are implicit specifications. If a test asserts , that's a requirement.
process(x) == y - Type signatures and validation rules — schemas, type annotations, and validators define what the system accepts and rejects.
- Infer from code behavior — as a last resort, read the code and infer what it's supposed to do. Mark these as inferred requirements in QUALITY.md and flag them for user confirmation.
When working from non-formal requirements, label each scenario and test with a requirement tag that includes a confidence tier and source:
- — written by humans in a spec document. Authoritative.
[Req: formal — README §3] - — stated by the user but not in a formal doc. Treat as authoritative.
[Req: user-confirmed — "must handle empty input"] - — deduced from code. Flag for user review.
[Req: inferred — from validate_input() behavior]
Use this exact tag format in QUALITY.md scenarios, functional test documentation, and spec audit findings. It makes clear which requirements are authoritative and which need validation.
读取README、现有文档和构建配置文件( / / )。回答以下问题:
pyproject.tomlpackage.jsonCargo.toml- 该项目的功能是什么?(一句话总结)
- 使用的语言和关键依赖?
- 与哪些外部系统交互?
- 主要输出是什么?
查找规格文件。规格是功能测试的唯一来源。按以下顺序搜索:根目录下的/、、、、、、,然后是根目录下的文件。记录文件路径。
AGENTS.mdCLAUDE.mdspecs/docs/spec/design/architecture/adr/.md如果没有正式规格文档,本技能仍可工作——但需要从其他来源收集需求,优先级如下:
- 询问用户——他们通常了解需求,即使未书面记录。
- README和内联文档——许多项目会将需求嵌入README、API文档或代码注释。
- 现有测试套件——测试是隐含的规格。如果测试断言,这就是一项需求。
process(x) == y - 类型签名和验证规则——Schema、类型注解和验证器定义了系统接受和拒绝的内容。
- 从代码行为推断——最后手段是读取代码并推断其预期功能。在QUALITY.md中标记这些为推断需求,并提示用户确认。
基于非正式需求工作时,为每个场景和测试添加需求标签,包含置信度层级和来源:
- — 人类在规格文档中编写的内容,权威可信。
[Req: formal — README §3] - — 用户明确说明但未写入正式文档的内容,视为权威。
[Req: user-confirmed — "must handle empty input"] - — 从代码推断的内容,需标记供用户审核。
[Req: inferred — from validate_input() behavior]
在QUALITY.md场景、功能测试文档和规格审核结果中使用此标签格式。它明确了哪些需求是权威的,哪些需要验证。
Step 2: Map the Architecture
步骤2:映射架构
List source directories and their purposes. Read the main entry point, trace execution flow. Identify:
- The 3–5 major subsystems
- The data flow (Input → Processing → Output)
- The most complex module
- The most fragile module
列出源目录及其用途。读取主入口文件,追踪执行流程。识别:
- 3-5个主要子系统
- 数据流(输入 → 处理 → 输出)
- 最复杂的模块
- 最脆弱的模块
Step 3: Read Existing Tests
步骤3:读取现有测试
Read the existing test files — all of them for small/medium projects, or a representative sample from each subsystem for large ones. Identify: test count, coverage patterns, gaps, and any coverage theater (tests that look good but don't catch real bugs).
Critical: Record the import pattern. How do existing tests import project modules? Every language has its own conventions (Python manipulation, Java/Scala package imports, TypeScript relative paths or aliases, Go package/module paths, Rust or ). You must use the exact same pattern in your functional tests — getting this wrong means every test fails with import/resolution errors. See § "Import Pattern" for the full six-language matrix.
sys.pathuse crate::use myproject::references/functional_tests.mdIdentify integration test runners. Look for scripts or test files that exercise the system end-to-end against real external services (APIs, databases, etc.). Note their patterns — you'll need them for .
RUN_INTEGRATION_TESTS.md读取所有现有测试文件——中小型项目全读,大型项目读取每个子系统的代表性样本。识别:测试数量、覆盖模式、缺口以及任何无效覆盖(coverage theater)(看似规范但无法发现真实bug的测试)。
关键:记录导入模式。现有测试如何导入项目模块?每种语言都有自己的约定(Python的调整、Java/Scala的包导入、TypeScript的相对路径或别名、Go的包/模块路径、Rust的或)。功能测试必须使用完全相同的模式——否则所有测试都会因导入/解析错误失败。详见 § "Import Pattern"中的六语言对照表。
sys.pathuse crate::use myproject::references/functional_tests.md识别集成测试运行器。查找用于端到端测试系统与真实外部服务(API、数据库等)的脚本或测试文件。记录它们的模式——这将用于编写。
RUN_INTEGRATION_TESTS.mdStep 4: Read the Specifications
步骤4:读取规格文件
Walk each spec document section by section. For every section, ask: "What testable requirement does this state?" Record spec requirements without corresponding tests — these are the gaps the functional tests must close.
If using inferred requirements (from tests, types, or code behavior), tag each with its confidence tier using the format defined in Step 1. Inferred requirements feed into QUALITY.md scenarios and should be flagged for user review in Phase 4.
[Req: tier — source]逐节阅读每个规格文档。对于每个章节,提问:“这阐述了哪些可测试的需求?”记录没有对应测试的规格需求——这些是功能测试需要填补的缺口。
如果使用推断需求(来自测试、类型或代码行为),使用步骤1中定义的格式为每个需求标记置信度层级。推断需求会纳入QUALITY.md场景,并在阶段4中提示用户审核。
[Req: tier — source]Step 4b: Read Function Signatures and Real Data
步骤4b:读取函数签名和真实数据
Before writing any test, you must know exactly how each function is called. For every module you identified in Step 2:
- Read the actual function signatures — parameter names, types, defaults. Don't guess from usage context — read the function definition and any documentation (Python docstrings, Java/Scala Javadoc/ScalaDoc, TypeScript type annotations, Go godoc comments, Rust doc comments and type signatures).
- Read real data files — If the project has items files, fixture files, config files, or sample data (in ,
pipelines/,fixtures/,test_data/), read them. Your test fixtures must match the real data shape exactly.examples/ - Read existing test fixtures — How do existing tests create test data? Copy their patterns. If they build config dicts with specific keys, use those exact keys.
- Check library versions — Check the project's dependency manifest (,
requirements.txt,build.sbt,package.json/pom.xml,build.gradle,go.mod) to see what's actually available. Don't write tests that depend on library features that aren't installed. If a dependency might be missing, use the test framework's skip mechanism — seeCargo.toml§ "Library version awareness" for framework-specific examples.references/functional_tests.md
Record a function call map: for each function you plan to test, write down its name, module, parameters, and what it returns. This map prevents the most common test failure: calling functions with wrong arguments.
在编写任何测试前,必须确切了解每个函数的调用方式。对于步骤2中识别的每个模块:
- 读取实际函数签名——参数名称、类型、默认值。不要从使用场景猜测——读取函数定义和任何文档(Python的docstring、Java/Scala的Javadoc/ScalaDoc、TypeScript的类型注解、Go的godoc注释、Rust的doc注释和类型签名)。
- 读取真实数据文件——如果项目有items文件、 fixture文件、配置文件或示例数据(位于、
pipelines/、fixtures/、test_data/),请读取它们。测试fixture必须与真实数据结构完全匹配。examples/ - 读取现有测试fixture——现有测试如何创建测试数据?复制它们的模式。如果它们使用特定键构建配置字典,就使用完全相同的键。
- 检查库版本——查看项目的依赖清单(、
requirements.txt、build.sbt、package.json/pom.xml、build.gradle、go.mod)确认已安装的依赖。不要编写依赖未安装库功能的测试。如果依赖可能缺失,使用测试框架的跳过机制——详见Cargo.toml§ "Library version awareness"中的框架示例。references/functional_tests.md
记录函数调用映射:对于每个计划测试的函数,写下其名称、模块、参数和返回值。此映射可避免最常见的测试失败:调用函数时传入错误参数。
Step 5: Find the Skeletons
步骤5:寻找防御性模式
This is the most important step. Search for defensive code patterns — each one is evidence of a past failure or known risk.
Why this matters: Developers don't write blocks, null checks, or retry logic for fun. Every piece of defensive code exists because someone got burned. A around a JSON parse means malformed JSON happened in production. A null check on a field means that field was missing when it shouldn't have been. These patterns are the codebase whispering its history of failures. Each one becomes a fitness-to-purpose scenario and a boundary test.
try/excepttry/exceptRead for the systematic search approach, grep patterns, and how to convert findings into fitness-to-purpose scenarios and boundary tests.
references/defensive_patterns.mdMinimum bar: at least 2–3 defensive patterns per core source file. If you find fewer, you're skimming — read function bodies, not just signatures.
这是最重要的步骤。搜索防御性代码模式——每个模式都是过去故障或已知风险的证据。
为何重要? 开发者不会为了好玩而编写块、空值检查或重试逻辑。每段防御性代码的存在都是因为有人曾遭遇过故障。JSON解析周围的意味着生产环境曾出现过格式错误的JSON。字段的空值检查意味着该字段曾在不应缺失时缺失。这些模式是代码库在诉说其故障历史。每个模式都会转化为适用性(fitness-to-purpose)场景和边界测试。
try/excepttry/except阅读 获取系统搜索方法、grep模式以及如何将发现转化为适用性场景和边界测试。
references/defensive_patterns.md最低要求:每个核心源文件至少找到2-3个防御性模式。如果找到的数量更少,说明只是浅尝辄止——请读取函数体,而非仅看签名。
Step 5b: Map Schema Types
步骤5b:映射Schema类型
If the project has a validation layer (Pydantic models in Python, JSON Schema, TypeScript interfaces/Zod schemas, Java Bean Validation annotations, Scala case class codecs), read the schema definitions now. For every field you found a defensive pattern for, record what the schema accepts vs. rejects.
Read for the mapping format and why this matters for writing valid boundary tests.
references/schema_mapping.md如果项目有验证层(Python的Pydantic模型、JSON Schema、TypeScript的接口/Zod schema、Java的Bean Validation注解、Scala的case class编解码器),现在请读取schema定义。为每个找到防御性模式的字段,记录schema接受和拒绝的值。
阅读 获取映射格式以及为何这对编写有效的边界测试至关重要。
references/schema_mapping.mdStep 6: Identify Quality Risks (Code + Domain Knowledge)
步骤6:识别质量风险(代码 + 领域知识)
Every project has a different failure profile. This step uses two sources — not just code exploration, but your training knowledge of what goes wrong in similar systems.
From code exploration, ask:
- What does "silently wrong" look like for this project?
- What external dependencies can change without warning?
- What looks simple but is actually complex?
- Where do cross-cutting concerns hide?
From domain knowledge, ask:
- "What goes wrong in systems like this?" — If it's a batch processor, think about crash recovery, idempotency, silent data loss, state corruption. If it's a web app, think about auth edge cases, race conditions, input validation bypasses. If it handles randomness or statistics, think about seeding, correlation, distribution bias.
- "What produces correct-looking output that is actually wrong?" — This is the most dangerous class of bug: output that passes all checks but is subtly corrupted.
- "What happens at 10x scale that doesn't happen at 1x?" — Chunk boundaries, rate limits, timeout cascading, memory pressure.
- "What happens when this process is killed at the worst possible moment?" — Mid-write, mid-transaction, mid-batch-submission.
Generate realistic failure scenarios from this knowledge. You don't need to have observed these failures — you know from training that they happen to systems of this type. Write them as architectural vulnerability analyses with specific quantities and consequences. Frame each as "this architecture permits the following failure mode" — not as a fabricated incident report. Use concrete numbers to make the severity non-negotiable: "If the process crashes mid-write during a 10,000-record batch, without an atomic rename pattern will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention." Then ground them in the actual code you explored: "Read persistence.py line ~340 (save_state): verify temp file + rename pattern."
save_state()每个项目的故障特征都不同。此步骤使用两个来源——不仅是代码探索,还包括类似系统常见故障的训练知识。
从代码探索出发,提问:
- 该项目的“隐性错误”是什么样的?
- 哪些外部依赖会无预警变化?
- 哪些看似简单实则复杂的部分?
- 横切关注点隐藏在何处?
从领域知识出发,提问:
- “类似系统通常会出现哪些故障?”——如果是批处理程序,考虑崩溃恢复、幂等性、隐性数据丢失、状态损坏。如果是Web应用,考虑认证边缘情况、竞态条件、输入验证绕过。如果处理随机性或统计数据,考虑种子、相关性、分布偏差。
- “哪些情况会产生看似正确但实际错误的输出?”——这是最危险的bug类型:输出通过所有检查但存在细微损坏。
- “10倍流量下会出现哪些1倍流量下没有的问题?”——块边界、速率限制、超时级联、内存压力。
- “进程在最糟糕的时刻被杀死会发生什么?”——写入中途、事务中途、批处理提交中途。
基于这些知识生成真实的故障场景。无需实际观察过这些故障——从训练中得知此类系统会发生这些问题。将其写成架构脆弱性分析,包含具体数量和后果。表述为“该架构允许以下故障模式”——而非虚构的事件报告。使用具体数字让严重性无可辩驳:“如果进程在10000条记录的批处理写入中途崩溃,无原子重命名模式的会留下损坏的状态文件——下次运行会出现JSONDecodeError,无法自动恢复,需人工干预。”然后结合探索到的实际代码:“读取persistence.py第340行左右的:验证临时文件+重命名模式。”
save_state()save_state()Phase 2: Generate the Quality Playbook
阶段2:生成质量手册
Now write the six files. For each one, follow the structure below and consult the relevant reference file for detailed guidance.
Why six files instead of just tests? Tests catch regressions but don't prevent new categories of bugs. The quality constitution () tells future sessions what "correct" means before they start writing code. The protocols () provide structured processes for review, integration testing, and spec auditing that produce repeatable results — instead of leaving quality to whatever the AI feels like checking. Together, these files create a quality system where each piece reinforces the others: scenarios in QUALITY.md map to tests in the functional test file, which are verified by the integration protocol, which is audited by the Council of Three.
QUALITY.mdRUN_*.md现在编写六类文件。对于每个文件,遵循以下结构并参考相关参考文件获取详细指导。
为何是六类文件而非仅测试? 测试可捕获回归问题,但无法预防新类别的bug。质量章程()在AI会话开始编写代码前就明确了“正确”的定义。协议文件()提供结构化的审查、集成测试和规格审核流程,产生可重复的结果——而非将质量交给AI的主观判断。这些文件共同构成质量体系,各部分相互强化:QUALITY.md中的场景映射到功能测试文件中的测试,集成协议验证这些测试,Council of Three审核集成协议。
QUALITY.mdRUN_*.mdFile 1: quality/QUALITY.md
— Quality Constitution
quality/QUALITY.md文件1:quality/QUALITY.md
— 质量章程
quality/QUALITY.mdRead for the full template and examples.
references/constitution.mdThe constitution has six sections:
- Purpose — What quality means for this project, grounded in Deming (built in, not inspected), Juran (fitness for use), Crosby (quality is free). Apply these specifically: what does "fitness for use" mean for this system? Not "tests pass" but the actual operational requirement.
- Coverage Targets — Table mapping each subsystem to a target with rationale referencing real risks. Every target must have a "why" grounded in a specific scenario — without it, a future AI session will argue the target down.
- Coverage Theater Prevention — Project-specific examples of fake tests, derived from what you saw during exploration. (Why: AI-generated tests often pad coverage numbers without catching real bugs — asserting that imports worked, that dicts have keys, or that mocks return what they were configured to return. Calling this out explicitly stops the pattern.)
- Fitness-to-Purpose Scenarios — The heart of it. Each scenario documents a realistic failure mode with code references and verification method. Aim for 2+ scenarios per core module — typically 8–10 total for a medium project, fewer for small projects, more for complex ones. Quality matters more than count: a scenario that precisely captures a real architectural vulnerability is worth more than three generic ones. (Why: Coverage percentages tell you how much code ran, not whether it ran correctly. A system can have 95% coverage and still lose records silently. Fitness scenarios define what "working correctly" actually means in concrete terms that no one can argue down.)
- AI Session Quality Discipline — Rules every AI session must follow
- The Human Gate — Things requiring human judgment
Scenario voice is critical. Write "What happened" as architectural vulnerability analyses with specific quantities, cascade consequences, and detection difficulty — not as abstract specifications. "Because lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume. At scale, this risks silent loss of 1,693+ records with no detection mechanism." An AI session reading that will not argue the standard down. Use your knowledge of similar systems to generate realistic failure scenarios, then ground them in the actual code you explored. Scenarios come from both code exploration AND domain knowledge about what goes wrong in systems like this.
save_state()Every scenario's "How to verify" must map to at least one test in the functional test file.
阅读 获取完整模板和示例。
references/constitution.md章程包含六个部分:
- 目的 — 该项目的质量定义,基于Deming(质量内置,而非事后检查)、Juran(适用性)、Crosby(质量免费)的理念。具体应用:该系统的“适用性”意味着什么?不是“测试通过”,而是实际的运营需求。
- 覆盖目标 — 表格映射每个子系统到目标,并结合真实风险说明理由。每个目标必须有基于特定场景的“原因”——否则未来的AI会话会质疑该目标。
- 无效覆盖(coverage theater)预防 — 基于探索过程中发现的问题,列出项目特有的无效测试示例。(原因:AI生成的测试常通过断言导入成功、字典包含键或mock返回配置值来填充覆盖率数字,却无法发现真实bug。明确指出可阻止此模式。)
- 适用性(fitness-to-purpose)场景 — 核心内容。每个场景记录真实的故障模式,包含代码引用和验证方法。每个核心模块至少2个场景——中型项目通常8-10个,小型项目更少,复杂项目更多。质量比数量重要:精准捕获真实架构脆弱性的场景,价值胜过三个通用场景。(原因:覆盖率百分比仅说明代码运行量,而非是否正确运行。系统可能有95%的覆盖率,但仍会隐性丢失记录。适用性场景以具体术语定义了“正确运行”的真实含义,无可辩驳。)
- AI会话质量规范 — 所有AI会话必须遵循的规则
- 人工审核环节 — 需要人工判断的内容
场景表述至关重要。将“发生了什么”写成架构脆弱性分析,包含具体数量、级联后果和检测难度——而非抽象规格。“因为缺乏原子重命名模式,10000条记录的批处理写入中途崩溃会留下损坏的状态文件——下次运行会出现JSONDecodeError,无法恢复。大规模场景下,这可能导致1693条以上记录隐性丢失且无法检测。”AI会话读取此内容后不会质疑标准。结合探索到的实际代码和领域知识生成真实场景。
save_state()每个场景的“验证方式”必须至少映射到功能测试文件中的一个测试。
File 2: Functional Tests
文件2:功能测试
This is the most important deliverable. Read for the complete guide.
references/functional_tests.mdOrganize the tests into three logical groups (classes, describe blocks, modules, or whatever the test framework uses):
- Spec requirements — One test per testable spec section. Each test's documentation cites the spec requirement it verifies.
- Fitness scenarios — One test per QUALITY.md scenario. 1:1 mapping, named to match.
- Boundaries and edge cases — One test per defensive pattern from Step 5.
Key rules:
- Match the existing import pattern exactly. Read how existing tests import project modules and do the same thing. Getting this wrong means every test fails.
- Read every function's signature before calling it. Read the actual line — parameter names, types, defaults. Read real data files from the project to understand data shapes. Do not guess at function parameters or fixture structures.
def - No placeholder tests. Every test must import and call actual project code. If the body is or the assertion is trivial (
pass), delete it. A test that doesn't exercise project code inflates the count and creates false confidence.assert isinstance(x, list) - Test count heuristic = (testable spec sections) + (QUALITY.md scenarios) + (defensive patterns). For a medium project (5–15 source files), this typically yields 35–50 tests. Significantly fewer suggests missed requirements or shallow exploration. Significantly more is fine if every test is meaningful — don't pad to hit a number.
- Cross-variant heuristic: ~30% — If the project handles multiple input types, aim for roughly 30% of tests parametrized across all variants. The exact percentage matters less than ensuring every cross-cutting property is tested across all variants.
- Test outcomes, not mechanisms — Assert what the spec says should happen, not how the code implements it.
- Use schema-valid mutations — Boundary tests must use values the schema accepts (from Step 5b), not values it rejects.
这是最重要的交付物。阅读获取完整指南。
references/functional_tests.md将测试分为三个逻辑组(类、describe块、模块或测试框架支持的结构):
- 规格需求 — 每个可测试的规格章节对应一个测试。每个测试的文档需引用其验证的规格需求。
- 适用性场景 — 每个QUALITY.md场景对应一个测试,1:1映射,名称匹配。
- 边界和边缘情况 — 每个步骤5中发现的防御性模式对应一个测试。
关键规则:
- 完全匹配现有导入模式。读取现有测试如何导入项目模块,功能测试使用完全相同的模式——否则所有测试都会失败。
- 调用函数前先读取其签名。读取实际的行——参数名称、类型、默认值。读取项目的真实数据文件理解数据结构。不要猜测函数参数或fixture结构。
def - 无占位测试。每个测试必须导入并调用实际项目代码。如果测试体是或断言无意义(
pass),请删除它。不执行项目代码的测试会虚增数量,产生虚假信心。assert isinstance(x, list) - 测试数量启发式 = (可测试的规格章节数) + (QUALITY.md场景数) + (防御性模式数)。中型项目(5-15个源文件)通常生成35-50个测试。数量明显更少意味着遗漏需求或探索不深入。数量更多也无妨,只要每个测试都有意义——不要为了凑数而添加测试。
- 跨变体启发式:约30% — 如果项目处理多种输入类型,约30%的测试需参数化覆盖所有变体。具体百分比不重要,重要的是确保每个横切属性都在所有变体中测试。
- 测试结果,而非实现机制 — 断言规格要求的结果,而非代码的实现方式。
- 使用Schema允许的变异 — 边界测试必须使用Schema接受的值(来自步骤5b),而非拒绝的值。
File 3: quality/RUN_CODE_REVIEW.md
quality/RUN_CODE_REVIEW.md文件3:quality/RUN_CODE_REVIEW.md
quality/RUN_CODE_REVIEW.mdRead for the template.
references/review_protocols.mdKey sections: bootstrap files, focus areas mapped to architecture, and these mandatory guardrails:
- Line numbers are mandatory — no line number, no finding
- Read function bodies, not just signatures
- If unsure: flag as QUESTION, not BUG
- Grep before claiming missing
- Do NOT suggest style changes — only flag things that are incorrect
阅读 获取模板。
references/review_protocols.md关键部分:引导文件、与架构映射的重点领域,以及这些强制防护规则:
- 必须包含行号——无行号,无结论
- 读取函数体,而非仅看签名
- 不确定时:标记为QUESTION,而非BUG
- 先搜索再断言缺失
- 不建议风格变更——仅标记错误内容
File 4: quality/RUN_INTEGRATION_TESTS.md
quality/RUN_INTEGRATION_TESTS.md文件4:quality/RUN_INTEGRATION_TESTS.md
quality/RUN_INTEGRATION_TESTS.mdRead for the template.
references/review_protocols.mdMust include: safety constraints, pre-flight checks, test matrix with specific pass criteria, an execution UX section, and a structured reporting format. Cover happy path, cross-variant consistency, output correctness, and component boundaries.
All commands must use relative paths. The generated protocol should include a "Working Directory" section at the top stating that all commands run from the project root using relative paths. Never generate commands that to an absolute path — this breaks when the protocol is run from a different machine or directory. Use , , , etc.
cd./scripts/./pipelines/./quality/Include an Execution UX section. When someone tells an AI agent to "run the integration tests," the agent needs to know how to present its work. The protocol should specify three phases: (1) show the plan as a numbered table before running anything, (2) report one-line progress updates as each test runs (//), (3) show a summary table with pass/fail counts and a recommendation. See section "Execution UX" for the template and examples. Without this, the agent dumps raw output or stays silent — neither is useful.
✓✗⧗references/review_protocols.mdThis protocol must exercise real external dependencies. If the project talks to APIs, databases, or external services, the integration test protocol runs real end-to-end executions against those services — not just local validation checks. Design the test matrix around the project's actual execution modes and external dependencies. Look for API keys, provider abstractions, and existing integration test scripts during exploration and build on them.
Derive quality gates from the code, not generic checks. Read validation rules, schema enums, and generation logic during exploration. Turn them into per-pipeline quality checks with specific fields and acceptable value ranges. "All units validated" is not enough — the protocol must verify domain-specific correctness.
Script parallelism, don't just describe it. Group runs so independent executions (different providers) run concurrently. Include actual bash commands with and . One run per provider at a time to avoid rate limits.
&waitCalibrate unit counts to the project. Read or equivalent config. Use enough units to span at least 2 chunks and enough to verify distribution checks. Typically 10–30 for integration testing.
chunk_sizeDeep post-run verification. Don't stop at "process completed." Verify log files, manifest state, output data existence, sample record content, and any existing quality check scripts — for every run.
Find and use existing verification tools. Search for existing scripts that verify output quality (e.g., , validation scripts, quality gate functions). If they exist, call them from the protocol. If the project has a TUI or dashboard, include TUI verification commands (e.g., flags) in the post-run checklist.
integration_checks.py--dumpBuild a Field Reference Table before writing quality gates. This is the most important step for protocol accuracy. AI models confidently write wrong field names even after reading schemas — becomes , becomes , becomes . The fix is procedural: re-read each schema file IMMEDIATELY before writing each table row. Do not rely on what you read earlier in the conversation — your memory of field names drifts over thousands of tokens. Copy field names character-for-character from the file contents. Include ALL fields from each schema (if the schema has 8 fields, the table has 8 rows). See section "The Field Reference Table" for the full process and format. Do not skip this step — it prevents the single most common protocol inaccuracy.
document_iddoc_idsentiment_scoresentimentfloat 0-1int 0-100references/review_protocols.md阅读 获取模板。
references/review_protocols.md必须包含:安全约束、预检查、带具体通过标准的测试矩阵、执行UX部分以及结构化报告格式。覆盖正常路径、跨变体一致性、输出正确性和组件边界。
所有命令必须使用相对路径。生成的协议应包含“工作目录”部分,说明所有命令从项目根目录使用相对路径运行。绝不要生成使用绝对路径的命令——这会导致协议在不同机器或目录下运行失败。使用、、等路径。
cd./scripts/./pipelines/./quality/包含执行UX部分。当有人告诉AI Agent“运行集成测试”时,Agent需要知道如何展示工作。协议应指定三个阶段:(1) 运行前以编号表格展示计划;(2) 每个测试运行时报告单行进度更新(//);(3) 展示包含通过/失败计数和建议的汇总表格。详见中的“Execution UX”部分模板和示例。没有此部分,Agent会输出原始内容或保持沉默——两者都无用。
✓✗⧗references/review_protocols.md此协议必须测试真实外部依赖。如果项目与API、数据库或外部服务交互,集成测试协议需针对这些服务运行真实端到端执行——而非仅本地验证检查。围绕项目的实际执行模式和外部依赖设计测试矩阵。探索期间查找API密钥、提供者抽象和现有集成测试脚本,并在此基础上构建协议。
从代码推导质量门限,而非通用检查。探索期间读取验证规则、Schema枚举和生成逻辑。将它们转化为每个流水线的质量检查,包含具体字段和可接受值范围。“所有单元已验证”不够——协议必须验证领域特定的正确性。
脚本化并行执行,而非仅描述。分组运行,使独立执行(不同提供者)并发运行。包含带和的实际bash命令。每次运行一个提供者,避免触发速率限制。
&wait根据项目调整单元数量。读取或等效配置。使用足够的单元覆盖至少2个块,并验证分布检查。集成测试通常使用10-30个单元。
chunk_size深入的运行后验证。不要仅停留在“进程完成”。验证日志文件、清单状态、输出数据存在性、样本记录内容以及任何现有质量检查脚本——针对每次运行。
查找并使用现有验证工具。搜索用于验证输出质量的现有脚本(如、验证脚本、质量门函数)。如果存在,从协议中调用它们。如果项目有TUI或仪表板,在运行后检查清单中包含TUI验证命令(如标志)。
integration_checks.py--dump编写质量门限前先构建字段参考表。这是确保协议准确性最重要的步骤。AI模型即使读取了Schema,也会自信地写错字段名——变成、变成、变成。解决方法是:编写每个表格行前立即重新读取每个Schema文件。不要依赖对话早期读取的内容——字段名记忆会随数千个token漂移。从文件内容中逐字符复制字段名。包含每个Schema的所有字段(如果Schema有8个字段,表格就有8行)。详见中的“The Field Reference Table”部分的完整流程和格式。不要跳过此步骤——它可避免最常见的协议错误。
document_iddoc_idsentiment_scoresentimentfloat 0-1int 0-100references/review_protocols.mdFile 5: quality/RUN_SPEC_AUDIT.md
— Council of Three
quality/RUN_SPEC_AUDIT.md文件5:quality/RUN_SPEC_AUDIT.md
— Council of Three
quality/RUN_SPEC_AUDIT.mdRead for the full protocol.
references/spec_audit.mdThree independent AI models audit the code against specifications. Why three? Because each model has different blind spots — in practice, different auditors catch different issues. Cross-referencing catches what any single model misses.
The protocol defines: a copy-pasteable audit prompt with guardrails, project-specific scrutiny areas, a triage process (merge findings by confidence level), and fix execution rules (small batches by subsystem, not mega-prompts).
阅读 获取完整协议。
references/spec_audit.md三个独立AI模型对照规格审核代码。为何是三个?因为每个模型有不同的盲区——实践中,不同审核者会发现不同问题。交叉引用可发现单个模型遗漏的内容。
协议定义:带防护规则的可复制审核提示、项目特定审查领域、分类流程(按置信度合并发现)以及修复执行规则(按子系统小批量处理,而非大段提示)。
File 6: AGENTS.md
AGENTS.md文件6:AGENTS.md
AGENTS.mdIf already exists, update it — don't replace it. Add a Quality Docs section pointing to all generated files.
AGENTS.mdIf creating from scratch: project description, setup commands, build & test commands, architecture overview, key design decisions, known quirks, and quality docs pointers.
如果已存在,更新它——不要替换。添加指向所有生成文件的“质量文档”部分。
AGENTS.md如果从零开始创建:项目描述、设置命令、构建&测试命令、架构概述、关键设计决策、已知问题以及质量文档链接。
Phase 3: Verify
阶段3:验证
Why a verification phase? AI-generated output can look polished and be subtly wrong. Tests that reference undefined fixtures report 0 failures but 16 errors — and "0 failures" sounds like success. Integration protocols can list field names that don't exist in the actual schemas. The verification phase catches these problems before the user discovers them, which is important because trust in a generated quality playbook is fragile — one wrong field name undermines confidence in everything else.
为何需要验证阶段? AI生成的内容可能看似规范但存在细微错误。引用未定义fixture的测试报告0失败但16个错误——而“0失败”听起来像成功。集成协议可能列出Schema中不存在的字段名。验证阶段可在用户发现前捕获这些问题,因为用户对生成的质量手册的信任很脆弱——一个错误的字段名就会破坏对所有内容的信心。
Self-Check Benchmarks
自检基准
Before declaring done, check every benchmark. Read for the complete checklist.
references/verification.mdThe critical checks:
- Test count near heuristic target (spec sections + scenarios + defensive patterns)
- Scenario coverage — scenario test count matches QUALITY.md scenario count
- Cross-variant coverage — ~30% of tests parametrize across all input variants
- Boundary test count ≈ defensive pattern count from Step 5
- Assertion depth — Majority of assertions check values, not just presence
- Layer correctness — Tests assert outcomes (what spec says), not mechanisms (how code implements)
- Mutation validity — Every fixture mutation uses a schema-valid value from Step 5b
- All tests pass — zero failures AND zero errors. Run the test suite using the project's test runner (Python: , Scala:
pytest -v, Java:sbt testOnly/mvn test, TypeScript:gradle test, Go:npx jest, Rust:go test -v) and check the summary. Errors from missing fixtures, failed imports, or unresolved dependencies count as broken tests. If you see setup errors, you forgot to create the fixture/setup file or referenced undefined test helpers.cargo test - Existing tests unbroken — The new files didn't break anything.
- Integration test quality gates were written from a Field Reference Table. Verify that you built a Field Reference Table by re-reading each schema file before writing quality gates, and that every field name in the quality gates is copied from that table — not from memory. If you skipped the table, go back and build it now.
If any benchmark fails, go back and fix it before proceeding.
完成前,请检查所有基准。阅读 获取完整检查清单。
references/verification.md关键检查项:
- 测试数量接近启发式目标(规格章节数 + 场景数 + 防御性模式数)
- 场景覆盖 — 场景测试数量与QUALITY.md场景数匹配
- 跨变体覆盖 — 约30%的测试参数化覆盖所有输入变体
- 边界测试数量 ≈ 步骤5中发现的防御性模式数
- 断言深度 — 大多数断言检查值,而非仅存在性
- 层级正确性 — 测试断言规格要求的结果,而非代码的实现机制
- 变异有效性 — 每个fixture变异使用步骤5b中Schema允许的值
- 所有测试通过——零失败且零错误。使用项目的测试运行器执行测试套件(Python: 、Scala:
pytest -v、Java:sbt testOnly/mvn test、TypeScript:gradle test、Go:npx jest、Rust:go test -v)并检查汇总。缺失fixture、导入失败或依赖未解析导致的错误都视为测试损坏。如果出现设置错误,说明忘记创建fixture/设置文件或引用了未定义的测试助手。cargo test - 现有测试未被破坏 — 新生成的文件未破坏任何现有功能。
- 集成测试质量门限基于字段参考表编写。验证编写质量门限前是否构建了字段参考表,重新读取每个Schema文件,且质量门限中的每个字段名都来自该表——而非记忆。如果跳过了此步骤,请返回构建。
如果任何基准未通过,请返回修复后再继续。
Phase 4: Present, Explore, Improve (Interactive)
阶段4:展示、探索、优化(交互式)
After generating and verifying, present the results clearly and give the user control over what happens next. This phase has three parts: a scannable summary, drill-down on demand, and a menu of improvement paths.
Do not skip this phase. The autonomous output from Phases 1-3 is a solid starting point, but the user needs to understand what was generated, explore what matters to them, and choose how to improve it. A quality playbook is only useful if the people who own the project trust it and understand it. Dumping six files without explanation creates artifacts nobody reads.
生成并验证后,请清晰展示结果,让用户控制后续操作。此阶段包含三部分:可快速扫描的汇总表、按需深入查看、优化路径菜单。
不要跳过此阶段。阶段1-3的自主输出是坚实的起点,但用户需要理解生成的内容、探索对他们重要的部分,并选择优化方式。只有项目所有者信任并理解质量手册,它才有用。直接输出六个文件而不解释会生成无人阅读的工件。
Part 1: The Summary Table
第一部分:汇总表
Present a single table the user can scan in 10 seconds:
Here's what I generated:
| File | What It Does | Key Metric | Confidence |
|------|-------------|------------|------------|
| QUALITY.md | Quality constitution | 10 scenarios | ██████░░ High — grounded in code, but scenarios are inferred, not from real incidents |
| Functional tests | Automated tests | 47 passing | ████████ High — all tests pass, 35% cross-variant |
| RUN_CODE_REVIEW.md | Code review protocol | 8 focus areas | ████████ High — derived from architecture |
| RUN_INTEGRATION_TESTS.md | Integration test protocol | 9 runs × 3 providers | ██████░░ Medium — quality gates need threshold tuning |
| RUN_SPEC_AUDIT.md | Council of Three audit | 10 scrutiny areas | ████████ High — guardrails included |
| AGENTS.md | AI session bootstrap | Updated | ████████ High — factual |Adapt the table to what you actually generated — the file names, metrics, and confidence levels will vary by project. The confidence column is the most important: it tells the user where to focus their attention.
Confidence levels:
- High — Derived directly from code, specs, or schemas. Unlikely to need revision.
- Medium — Reasonable inference, but could be wrong. Benefits from user input.
- Low — Best guess. Definitely needs user input to be useful.
After the table, add a "Quick Start" block with ready-to-copy prompts for executing each artifact:
To use these artifacts, start a new AI session and try one of these prompts:
• Run a code review:
"Read quality/RUN_CODE_REVIEW.md and follow its instructions to review [module or file]."
• Run the functional tests:
"[test runner command, e.g. pytest quality/ -v, mvn test -Dtest=FunctionalTest, etc.]"
• Run the integration tests:
"Read quality/RUN_INTEGRATION_TESTS.md and follow its instructions."
• Start a spec audit (Council of Three):
"Read quality/RUN_SPEC_AUDIT.md and follow its instructions using [model name]."Adapt the test runner command and module names to the actual project. The point is to give the user copy-pasteable prompts — not descriptions of what they could do, but the actual text they'd type.
After the Quick Start block, add one line:
"You can ask me about any of these to see the details — for example, 'show me Scenario 3' or 'walk me through the integration test matrix.'"
展示用户可在10秒内扫描完成的单个表格:
以下是我生成的内容:
| 文件 | 功能 | 关键指标 | 置信度 |
|------|-------------|------------|------------|
| QUALITY.md | 质量章程 | 10个场景 | ██████░░ 高——基于代码,但场景为推断,非来自真实事件 |
| 功能测试 | 自动化测试 | 47个通过测试 | ████████ 高——所有测试通过,35%跨变体覆盖 |
| RUN_CODE_REVIEW.md | 代码审查协议 | 8个重点领域 | ████████ 高——基于架构推导 |
| RUN_INTEGRATION_TESTS.md | 集成测试协议 | 9次运行 × 3个提供者 | ██████░░ 中——质量门限需要阈值调优 |
| RUN_SPEC_AUDIT.md | Council of Three审核 | 10个审查领域 | ████████ 高——包含防护规则 |
| AGENTS.md | AI会话引导 | 已更新 | ████████ 高——内容符合事实 |根据实际生成的内容调整表格——文件名、指标和置信度会因项目而异。置信度列最重要:它告诉用户应关注哪些内容。
置信度等级:
- 高 — 直接从代码、规格或Schema推导,不太可能需要修订。
- 中 — 合理推断,但可能存在错误,需用户输入优化。
- 低 — 最佳猜测,肯定需要用户输入才能有用。
表格后添加“快速开始”块,包含可直接复制的提示,用于执行每个工件:
要使用这些工件,请启动新的AI会话并尝试以下提示之一:
• 运行代码审查:
"读取quality/RUN_CODE_REVIEW.md并按照其说明审查[模块或文件]。"
• 运行功能测试:
"[测试运行器命令,例如pytest quality/ -v, mvn test -Dtest=FunctionalTest等]"
• 运行集成测试:
"读取quality/RUN_INTEGRATION_TESTS.md并按照其说明执行。"
• 启动规格审核(Council of Three):
"读取quality/RUN_SPEC_AUDIT.md并使用[模型名称]按照其说明执行。"根据实际项目调整测试运行器命令和模块名称。重点是给用户可直接复制的提示——而非描述他们可以做什么,而是他们实际要输入的文本。
表格后添加一行:
“你可以询问任何内容以查看详情——例如,‘展示场景3’或‘带我了解集成测试矩阵’。”
Part 2: Drill-Down on Demand
第二部分:按需深入查看
When the user asks about a specific item, give a focused summary — not the whole file, but the key decisions and what you're uncertain about. Examples:
- "Tell me about Scenario 4" → Show the scenario text, explain where it came from (which defensive pattern or domain knowledge), and flag what you inferred vs. what you know.
- "Show me the integration test matrix" → Show the run groups, explain the parallelism strategy, and note which quality gates you derived from schemas vs. guessed at.
- "How do the functional tests work?" → Show the three test groups, explain the mapping to specs and scenarios, and highlight any tests you're least confident about.
The user may go through several drill-downs before they're ready to improve anything. That's fine — let them explore at their own pace.
当用户询问特定内容时,提供聚焦的汇总——而非整个文件,而是关键决策和不确定点。示例:
- “告诉我场景4的内容” → 展示场景文本,解释其来源(哪个防御性模式或领域知识),并标记推断内容与已知内容。
- “展示集成测试矩阵” → 展示运行组,解释并行策略,并标记哪些质量门限来自Schema、哪些是推断的。
- “功能测试如何工作?” → 展示三个测试组,解释与规格和场景的映射,并突出最不确定的测试。
用户可能需要多次深入查看后才准备优化。这很正常——让他们按自己的节奏探索。
Part 3: The Improvement Menu
第三部分:优化路径菜单
After the user has seen the summary (and optionally drilled into details), present the improvement options:
"Three ways to make this better:"1. Review and harden individual items — Pick any scenario, test, or protocol section and I'll walk through it with you. Good for: tightening specific quality gates, fixing inferred scenarios, adding missing edge cases.2. Guided Q&A — I'll ask you 3-5 targeted questions about things I couldn't infer from the code: incident history, expected distributions, cost tolerance, model preferences. Good for: filling knowledge gaps that make scenarios more authoritative.3. Review development history — Point me to exported AI chat history (Claude, Gemini, ChatGPT exports, Claude Code transcripts) and I'll mine it for design decisions, incident reports, and quality discussions that should be in QUALITY.md. Good for: grounding scenarios in real project history instead of inference."You can do any combination of these, in any order. Which would you like to start with?"
用户查看汇总后(可选深入查看),展示优化选项:
“有三种方式可优化这些内容:"1. 审查并强化单个内容 — 选择任意场景、测试或协议部分,我会带你逐一查看。适合:收紧特定质量门限、修复推断场景、添加遗漏的边缘情况。2. 引导式问答 — 我会问你3-5个针对性问题,关于无法从代码中推断的内容:事件历史、预期分布、成本容忍度、模型偏好。适合:填补知识缺口,让场景更权威。3. 审查开发历史 — 指向AI聊天历史导出文件(Claude、Gemini、ChatGPT导出、Claude Code记录),我会从中提取设计决策、事件报告和质量讨论,纳入QUALITY.md。适合:让场景基于真实项目历史而非推断。“你可以任意组合这些方式,按任意顺序进行。你想从哪一个开始?”
Executing Each Improvement Path
执行各优化路径
Path 1: Review and harden. The user picks an item. Walk through it: show the current text, explain your reasoning, ask if it's accurate. Revise based on their feedback. Re-run tests if the functional tests change.
Path 2: Guided Q&A. Ask 3-5 questions derived from what you actually found during exploration. These categories cover the most common high-leverage gaps:
- Incident history for scenarios. "I found [specific defensive code]. What failure caused this? How many records were affected?"
- Quality gate thresholds. "I'm checking that [field] contains [values]. What distribution is normal? What signals a problem?"
- Integration test scale and cost. "The protocol runs [N] tests costing roughly $[X]. Should I increase or decrease coverage?"
- Test scope. "I generated [N] functional tests. Your existing suite covers [other areas]. Are there gaps?"
- Model preferences for spec audit. "Which AI models do you use? Have you noticed specific strengths?"
After the user answers, revise the generated files and re-run tests.
Path 3: Review development history. If the user provides a chat history folder:
- Scan for index files and navigate to quality-relevant conversations (same approach as Step 0, but now with specific targets — you know which scenarios need grounding, which quality gates need thresholds, which design decisions need rationale).
- Extract: incident stories with specific numbers, design rationale for defensive patterns, quality framework discussions, cross-model audit results.
- Revise QUALITY.md scenarios with real incident details. Update integration test thresholds with real-world values. Add Council of Three empirical data if audit results exist.
- Re-run tests after revisions.
If the user already provided chat history in Step 0, you've already mined it — but they may want to point you to specific conversations or ask you to dig deeper into a particular topic.
路径1:审查并强化。用户选择一个内容。逐一查看:展示当前文本、解释推理过程、询问是否准确。根据反馈修订。如果功能测试变更,重新运行测试。
路径2:引导式问答。根据探索期间的实际发现,问3-5个问题。这些类别覆盖最常见的高价值缺口:
- 场景的事件历史。“我发现了[特定防御性代码]。是什么故障导致了这段代码的编写?影响了多少条记录?”
- 质量门限阈值。“我正在检查[字段]是否包含[值]。正常分布是什么样的?什么信号表示存在问题?”
- 集成测试规模和成本。“协议运行[N]次测试,成本约为$[X]。是否需要增加或减少覆盖范围?”
- 测试范围。“我生成了[N]个功能测试。现有测试套件覆盖了[其他领域]。是否存在缺口?”
- 规格审核的模型偏好。“你使用哪些AI模型?是否注意到它们的特定优势?”
用户回答后,修订生成的文件并重新运行测试。
路径3:审查开发历史。如果用户提供聊天历史文件夹:
- 扫描索引文件并导航到质量相关对话(与步骤0方法相同,但现在有特定目标——知道哪些场景需要强化、哪些质量门限需要阈值、哪些设计决策需要理由)。
- 提取:带具体数字的事件案例、防御性模式的设计理由、质量框架讨论、跨模型审核结果。
- 使用真实事件详情修订QUALITY.md场景。使用真实世界值更新集成测试阈值。如果存在跨模型审核结果,添加到Council of Three的经验数据。
- 修订后重新运行测试。
如果用户已在步骤0中提供聊天历史,你已经提取了相关内容——但他们可能希望指向特定对话或要求深入挖掘特定主题。
Iteration
迭代
The user can cycle through these paths as many times as they want. Each pass makes the quality playbook more grounded. When they're satisfied, they'll move on naturally — there's no explicit "done" step.
用户可循环使用这些路径任意次数。每次迭代都会让质量手册更贴合实际。当他们满意时,自然会结束——没有明确的“完成”步骤。
Fixture Strategy
Fixture策略
The folder is separate from the project's unit test folder. Create the appropriate test setup for the project's language:
quality/- Python: for pytest fixtures. If fixtures are defined inline (common with pytest's
quality/conftest.pypattern), prefer that over shared fixtures.tmp_path - Java: A test class with /
@BeforeEachsetup methods, or a shared test utility class.@BeforeAll - Scala: A trait mixed into test specs (e.g., ), or inline data builders.
trait FunctionalTestFixtures - TypeScript/JavaScript: A with
quality/setup.ts/beforeAllhooks, or inline test factories.beforeEach - Go: Helper functions in the same file or a shared
_test.go. Usetestutil_test.gofor test helpers. Go convention prefers inline test setup over shared fixtures.t.Helper() - Rust: Helper functions in a block, or a shared
#[cfg(test)] mod testsmodule. Use builder patterns for test data.test_utils.rs
Examine existing test files to understand how they set up test data. Whatever pattern the existing tests use, copy it. Study existing fixture patterns for realistic data shapes.
quality/- Python: 用于pytest fixture。如果fixture内联定义(pytest的
quality/conftest.py模式常见),优先使用内联而非共享fixture。tmp_path - Java: 带/
@BeforeEach设置方法的测试类,或共享测试工具类。@BeforeAll - Scala: 混入测试spec的trait(如),或内联数据构建器。
trait FunctionalTestFixtures - TypeScript/JavaScript: 带
quality/setup.ts/beforeAll钩子,或内联测试工厂。beforeEach - Go: 同一文件中的辅助函数,或共享
_test.go。使用testutil_test.go标记测试辅助函数。Go惯例优先内联测试设置而非共享fixture。t.Helper() - Rust: 块中的辅助函数,或共享
#[cfg(test)] mod tests模块。使用构建器模式创建测试数据。test_utils.rs
检查现有测试文件了解他们如何设置测试数据。无论现有测试使用什么模式,都复制它。研究现有fixture模式获取真实数据结构。
Terminology
术语
- Functional testing — Does the code produce the output specs say it should? Distinct from unit testing (individual functions in isolation).
- Integration testing — Do components work together end-to-end, including real external services?
- Spec audit — AI models read code and compare against specs. No code executed. Catches where code doesn't match documentation.
- Coverage theater — Tests that produce high coverage numbers but don't catch real bugs. Example: asserting a function didn't throw without checking its output.
- Fitness-to-purpose — Does the code do what it's supposed to do under real-world conditions? A system can have 95% coverage and still lose records silently.
- 功能测试 — 代码是否生成规格要求的输出?与单元测试(孤立测试单个函数)不同。
- 集成测试 — 组件能否端到端协同工作,包括真实外部服务?
- 规格审核 — AI模型读取代码并与规格对比。不执行代码。捕获代码与文档不匹配的地方。
- 无效覆盖(coverage theater) — 产生高覆盖率数字但无法发现真实bug的测试。示例:断言函数未抛出异常但不检查输出。
- 适用性(fitness-to-purpose) — 代码在真实世界条件下是否能完成预期功能?系统可能有95%的覆盖率但仍隐性丢失记录。
Principles
原则
- Fitness-to-purpose over coverage percentages
- Scenarios come from code exploration AND domain knowledge
- Concrete failure modes make standards non-negotiable — abstract requirements invite rationalization
- Guardrails transform AI review quality (line numbers, read bodies, grep before claiming)
- Triage before fixing — many "defects" are spec bugs or design decisions
- 适用性(fitness-to-purpose)优先于覆盖率百分比
- 场景来自代码探索和领域知识
- 具体故障模式让标准无可辩驳——抽象需求易引发合理化解释
- 防护规则可提升AI审查质量(行号要求、读取函数体、先搜索再断言)
- 修复前先分类——许多“缺陷”是规格bug或设计决策
Reference Files
参考文件
Read these as you work through each phase:
| File | When to Read | Contains |
|---|---|---|
| Step 5 (finding skeletons) | Grep patterns, how to convert findings to scenarios |
| Step 5b (schema types) | Field mapping format, mutation validity rules |
| File 1 (QUALITY.md) | Full template with section-by-section guidance |
| File 2 (functional tests) | Test structure, anti-patterns, cross-variant strategy |
| Files 3–4 (code review, integration) | Templates for both protocols |
| File 5 (Council of Three) | Full audit protocol, triage process, fix execution |
| Phase 3 (verify) | Complete self-check checklist with all 13 benchmarks |
工作时根据各阶段阅读以下文件:
| 文件 | 阅读时机 | 内容 |
|---|---|---|
| 步骤5(寻找防御性模式) | Grep模式、如何将发现转化为场景 |
| 步骤5b(Schema类型) | 字段映射格式、变异有效性规则 |
| 文件1(QUALITY.md) | 完整模板,含逐节指导 |
| 文件2(功能测试) | 测试结构、反模式、跨变体策略 |
| 文件3–4(代码审查、集成测试) | 两个协议的模板 |
| 文件5(Council of Three) | 完整审核协议、分类流程、修复执行 |
| 阶段3(验证) | 完整自检清单,含13个基准 |