testing-boss

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Testing Boss

Testing Boss

A consolidated doctrine for writing tests that reveal bugs, not just pass — for human-authored code, AI-generated code, LLM-powered features, and the CI that gates them all.
The cardinal premise: tests exist to expose defects, not to keep CI green. A test that fails has done its job. A test that passes for the wrong reason is worse than no test.
This skill collapses the old
test-antipatterns
skill plus a much larger corpus on test placement, framework idioms, flaky-test discipline, AI-agent test generation, and LLM/agent evaluation into one self-contained body of practice. Examples are language-agnostic pseudo-code so the doctrine transfers to any stack.
一个整合性的测试准则,旨在编写能够发现漏洞而非仅仅通过的测试——适用于人工编写的代码、AI生成的代码、LLM驱动的功能,以及管控所有这些环节的CI。
核心前提:测试的存在是为了暴露缺陷,而非维持CI绿灯状态。 失败的测试才完成了它的使命。因错误原因通过的测试比没有测试更糟糕。
本技能整合了旧版
test-antipatterns
技能,以及大量关于测试位置、框架惯例、不稳定测试规范、AI Agent测试生成、LLM/Agent评估的内容,形成一套独立完整的实践体系。示例采用语言无关的伪代码,因此该准则可迁移至任何技术栈。

Iron Laws

铁律

1. Test the behavior, never the mock.
2. Push every test to the lowest layer that can detect the failure.
3. When a test fails, fix production first — change the test only after writing why.
4. Real systems gate the merge. Mocks isolate; they do not validate.
5. Coverage is a flashlight. Mutation score is a quality probe. Neither is a target.
6. No test-only methods, branches, or flags leak into production code.
These six laws subsume every named anti-pattern in this skill. When two of them disagree, the lower-numbered one wins.
1. 测试行为,而非Mock。
2. 将每个测试放在能够检测到失败的最低层级。
3. 测试失败时,优先修复生产代码——仅在记录原因后再修改测试。
4. 真实系统管控合并流程。Mock用于隔离,而非验证。
5. 覆盖率是手电筒,变异分数(mutation score)是质量探测工具。两者都不应作为目标。
6. 禁止仅用于测试的方法、分支或标志渗透到生产代码中。
这六条定律涵盖了本技能中的所有命名反模式。当两条定律出现冲突时,编号较小的定律优先。

Required Reading Router

必读内容指引

Match the task to the row. Read the listed file(s) in full before producing output. The inline content in this SKILL.md is a tripwire, not the contract.
TaskMUST read
Deciding where a new test belongs (layer, file, owner)
references/foundations.md
Writing a new test (any layer, any framework)
references/patterns.md
Reviewing a test, smelling a problem, or fixing a brittle suite
references/antipatterns.md
Generating tests via a coding agent (Claude Code, Codex, Cursor)
references/ai-writes-tests.md
+
references/antipatterns.md
Triaging flaky tests, designing CI gates, or picking contract/property/mutation patterns
references/ci-automation.md
Designing evals for LLM/agent systems (RAG, tool use, prompt regression)
references/llm-eval.md
Looking up the original source for any claim in this skill
references/sources.md
将任务与对应行匹配。在输出结果前完整阅读列出的文件。本SKILL.md中的内联内容仅为提示,而非正式规范。
任务必读文件
确定新测试的归属(层级、文件、负责人)
references/foundations.md
编写新测试(任何层级、任何框架)
references/patterns.md
评审测试、排查问题或修复脆弱的测试套件
references/antipatterns.md
通过编码Agent生成测试(Claude Code、Codex、Cursor)
references/ai-writes-tests.md
+
references/antipatterns.md
分类不稳定测试、设计CI准入规则,或选择契约/属性/变异测试模式
references/ci-automation.md
为LLM/Agent系统设计评估套件(RAG、工具调用、提示回归)
references/llm-eval.md
查询本技能中任何主张的原始来源
references/sources.md

Reference Index

参考索引

  • references/foundations.md
    — placement doctrine (invariant + owning layer), pyramid vs trophy debate resolution, risk-based prioritization, coverage philosophy, test-boundary contracts.
  • references/patterns.md
    — 12 cross-framework positive patterns with agnostic pseudo-code: selector hierarchy, condition-based waits, per-test isolation, table-driven, builders/factories, behavior-first assertions, boundary-only mocking.
  • references/antipatterns.md
    — 25 anti-patterns across five families (Brittleness, Flakiness, Mock misuse, Process, AI-specific). Each entry: violation, why wrong, fix, gate question, evidence URL.
  • references/ai-writes-tests.md
    — seven mandatory gates with verbatim prompt blocks for any agent that generates tests: invariant first, owning layer, real execution, failure→fix production, no snapshot without contract, no assertion on self-set mock, negative companion.
  • references/ci-automation.md
    — flaky-test taxonomy, quarantine-plus-owner workflow, CI stage pyramid, contract / property / mutation / accessibility testing patterns, deterministic test architecture.
  • references/llm-eval.md
    — eval-driven development primer, oracle ladder, LLM-as-judge biases and calibration, RAG metrics, agent trajectory vs outcome eval, benchmark pitfalls.
  • references/sources.md
    — consolidated bibliography (all URLs grouped by axis) for citation and audit.
  • references/foundations.md
    — 测试位置准则(不变量与所属层级)、测试金字塔与奖杯模型的争议解决、基于风险的优先级划分、覆盖率理念、测试边界契约。
  • references/patterns.md
    — 12种跨框架正向模式及通用伪代码:选择器层级、基于条件的等待、测试独立隔离、表驱动、构建器/工厂模式、行为优先断言、仅边界Mock。
  • references/antipatterns.md
    — 分属五大类(脆弱性、不稳定性、Mock误用、流程、AI特定)的25种反模式。每个条目包含:违规表现、错误原因、修复方案、准入检查问题、证据链接。
  • references/ai-writes-tests.md
    — 针对任何生成测试的Agent的7项强制准入规则及完整提示模板:先定义不变量、明确所属层级、真实执行验证、失败优先修复生产代码、无契约禁止快照、不得断言自设Mock、配套负面测试。
  • references/ci-automation.md
    — 不稳定测试分类法、隔离+负责人工作流、CI阶段金字塔、契约/属性/变异/可访问性测试模式、确定性测试架构。
  • references/llm-eval.md
    — 评估驱动开发入门、预言机阶梯、LLM作为评判者的偏差与校准、RAG指标、Agent轨迹与结果评估、基准测试陷阱。
  • references/sources.md
    — 整合参考文献(按维度分组的所有链接),用于引用与审计。

Decide before the first line of test code

编写测试代码前的决策

Most bad tests are placement failures, not assertion failures. A test in the wrong layer is brittle, slow, and duplicates work — or worse, it locks in implementation under the disguise of correctness.
Gist tripwires:
  • Name the invariant in one sentence before opening any test file. If the sentence is fuzzy, the invariant is not clear enough to test.
  • Place the test at the lowest layer that can fail when the invariant breaks. A higher-layer test is justified only when the invariant requires real integration the lower layer cannot prove.
  • Reject the test entirely when (likelihood-of-bug × blast-radius) is below the threshold worth ten minutes of maintenance. Not every line deserves a test.
STOP. Read
references/foundations.md
in full before placing a new test, splitting a test across layers, debating pyramid vs trophy, or arguing about coverage targets.
That file contains the placement decision tree, the explicit pyramid/trophy reconciliation, the test-boundary contract template, and the risk-based filter. The three tripwires above are detection cues, not the contract.
大多数糟糕的测试源于位置错误,而非断言错误。放置在错误层级的测试会变得脆弱、缓慢且重复工作——更糟的是,它会以正确性为幌子锁定实现细节。
关键提示:
  • 在打开任何测试文件前,用一句话明确不变量。如果这句话表述模糊,说明不变量还不够清晰,无法进行测试。
  • 将测试放在最低层级,该层级需能在不变量被破坏时检测到失败。只有当不变量需要真实集成且低层无法验证时,才需要高层级测试。
  • 当(漏洞可能性 × 影响范围)低于值得投入十分钟维护的阈值时,直接拒绝该测试。并非每一行代码都需要测试。
停止操作。在放置新测试、跨层级拆分测试、争论金字塔与奖杯模型,或讨论覆盖率目标前,请完整阅读
references/foundations.md
该文件包含测试位置决策树、明确的金字塔/奖杯模型调和方案、测试边界契约模板,以及基于风险的筛选规则。上述三个提示仅为检测线索,而非正式规范。

Pattern catalog (write tests that survive refactor)

模式目录(编写可经受重构的测试)

Twelve patterns recur across Playwright, Testing Library, Cypress, Jest, pytest, Go testing, and Pact. The framework is evidence; the principle is universal.
Named patterns (one-liners — full pseudo-code in the reference):
  1. Query by behavior and accessible role, never by CSS selector or DOM index.
  2. Selector hierarchy: role → label → text → test-id → structural. Stop at the highest rung that disambiguates.
  3. Wait on observable conditions, never on wall-clock sleeps.
  4. Each test is independent and order-free; setup beats teardown.
  5. One behavior per test, but as many assertions as that behavior needs.
  6. Test names read as specifications:
    should <outcome> when <condition> given <state>
    .
  7. Table-driven / parameterized when only the inputs vary.
  8. Build test data via factories or builders; literal blobs only for the field under test.
  9. Mock at boundaries you do not control; real wiring for what you own.
  10. Real systems gate the final merge; contract tests bridge unit and E2E.
  11. Mutation score, not coverage percentage, measures suite strength.
  12. Page Object Model is a tool, not a religion — collapse it for small suites.
STOP. Read
references/patterns.md
in full before writing any non-trivial test, choosing a selector strategy, designing test data, or deciding what to mock.
That file contains the pseudo-code, the cross-framework evidence, and the explicit "when to break this rule" carve-out for each pattern. The twelve one-liners above are a vocabulary index, not the contract — the operational rule for each pattern lives only in the reference.
12种模式在Playwright、Testing Library、Cypress、Jest、pytest、Go testing和Pact中反复出现。框架是证据,原则是通用的。
模式名称(一句话概述——完整伪代码见参考文件):
  1. 按行为和可访问角色查询,绝不按CSS选择器或DOM索引查询。
  2. 选择器层级:角色 → 标签 → 文本 → 测试ID → 结构。在能区分元素的最高层级停止。
  3. 等待可观测条件,绝不等待固定时长。
  4. 每个测试独立且无顺序依赖;前置设置优于后置清理。
  5. 每个测试对应一个行为,但该行为可包含多个断言。
  6. 测试名称需像规范一样可读:
    should <结果> when <条件> given <状态>
  7. 仅输入不同时,采用表驱动/参数化测试。
  8. 通过工厂或构建器生成测试数据;仅被测字段使用字面量。
  9. 在不受控的边界处使用Mock;自有模块使用真实连接。
  10. 真实系统管控最终合并;契约测试衔接单元测试与端到端测试。
  11. 用变异分数而非覆盖率百分比衡量测试套件强度。
  12. Page Object Model是工具而非教条——小型套件可简化使用。
停止操作。在编写任何非 trivial 测试、选择选择器策略、设计测试数据,或决定Mock对象前,请完整阅读
references/patterns.md
该文件包含伪代码、跨框架证据,以及每个模式的明确“何时打破规则”例外情况。上述12条概述仅为词汇索引,而非正式规范——每个模式的可操作规则仅存在于参考文件中。

Anti-pattern families (do not do these)

反模式类别(请勿效仿)

Twenty-five anti-patterns cluster into five families. The top seven (bolded below) cause the most damage in modern codebases — especially when AI agents write the tests.
Brittleness — tests bound to internals.
  1. Brittle/implementation-detail selectors.
  2. Testing internal structure instead of observable behavior.
  3. Testing private methods directly.
  4. Snapshot-as-test (a snapshot replacing real assertions).
  5. Vague existence assertions (
    .should('exist')
    ,
    toBeTruthy
    ).
  6. Action without assertion.
Flakiness — tests that randomize their own verdicts. 7. Static
sleep
/ fixed-timeout waits.
8. Test order dependency / hidden shared state. 9. Non-deterministic inputs (real clock, RNG, locale).
Mock misuse — tests that test the test setup. 10. Asserting the mock exists. (absorbed from the previous
test-antipatterns
skill)
11. Mock drift (mock no longer matches real API). 12. Over-mocking child components. 13. Incomplete mocks (missing fields the system consumes downstream). 14. Mocking at the wrong level (mocks slow-and-safe methods test logic depends on).
Process — pathologies of the team and the suite over time. 15. Coverage as a vanity metric. 16. Happy-path-only coverage. 17. Eternal
beforeAll
/ shared setup that hides dependencies. 18. Cleanup in
afterEach
(use
beforeEach
instead). 19. Magic strings and logic in tests. 20. Testing against third-party sites you do not control. 21. Quarantine-as-cemetery (skip without owner or deadline). 22. Retry-as-fix (auto-retry hiding real bugs). 23. Duplicate tests across pyramid layers. 24. Weakening tests to make them pass. 25. Mock-driven confidence (the test sets up a mock and then asserts on its own setup).
The old
test-antipatterns
skill covered five of these (Asserting the mock exists, Test-only methods in production, Mocking without understanding, Incomplete mocks, Integration tests as afterthought). All five survive here — folded into Mock misuse and Process — and the framing question "Are we testing the behavior of a mock?" still applies.
STOP. Read
references/antipatterns.md
in full when reviewing any test, debugging a flaky suite, refactoring tests after a refactor broke them, evaluating AI-generated tests, or deciding whether to delete or rescue a struggling test file.
That file has the full 25-entry catalog with violation pattern, why-wrong, fix, gate question, and citation per entry. The family list above is a topic index, not the contract — flagging a pattern by family is not the same as knowing the fix.
25种反模式分为五大类。以下加粗的7种在现代代码库中造成的危害最大——尤其是当AI Agent编写测试时。
脆弱性——测试绑定到内部实现。
  1. 脆弱/依赖实现细节的选择器。
  2. 测试内部结构而非可观测行为。
  3. 直接测试私有方法。
  4. 快照替代测试(用快照取代真实断言)。
  5. 模糊的存在断言(
    .should('exist')
    toBeTruthy
    )。
  6. 仅执行操作无断言。
不稳定性——测试结果随机变化。 7. 静态
sleep
/固定超时等待。
8. 测试顺序依赖/隐藏共享状态。 9. 非确定性输入(真实时钟、随机数生成器、区域设置)。
Mock误用——测试仅验证测试设置。 10. 断言Mock存在。(源自旧版
test-antipatterns
技能) 11. Mock漂移(Mock与真实API不再匹配)。 12. 过度Mock子组件。 13. 不完整Mock(缺少系统下游依赖的字段)。 14. 错误层级的Mock(Mock测试逻辑依赖的慢但安全的方法)。
流程问题——团队与测试套件随时间出现的病态问题。 15. 覆盖率作为虚荣指标。 16. 仅覆盖正常路径。 17. 永久
beforeAll
/隐藏依赖的共享设置。 18. 在
afterEach
中清理(改用
beforeEach
)。 19. 测试中使用魔法字符串和逻辑。 20. 测试不受控的第三方站点。 21. 隔离即废弃(跳过测试但无负责人或截止日期)。 22. 重试即修复(自动重试掩盖真实漏洞)。 23. 测试金字塔各层存在重复测试。 24. 弱化测试使其通过。 25. Mock驱动的信心(测试设置Mock后断言自身设置)。
旧版
test-antipatterns
技能涵盖其中5种(断言Mock存在、生产代码中的仅测试方法、无理解的Mock、不完整Mock、事后补充的集成测试)。这5种反模式均被保留——归入Mock误用和流程类别——且核心问题**“我们是否在测试Mock的行为?”**依然适用。
停止操作。在评审任何测试、调试不稳定的测试套件、重构导致测试失败后的代码、评估AI生成的测试,或决定删除/拯救问题测试文件时,请完整阅读
references/antipatterns.md
该文件包含完整的25条反模式目录,每条都有违规模式、错误原因、修复方案、准入检查问题和引用来源。上述类别列表仅为主题索引,而非正式规范——按类别标记模式不等于了解修复方案。

AI agents writing tests

AI Agent编写测试

Coding agents will mock everything to green by default. They produce long, linear tests with high assertion density and no edge cases — and they will happily patch the test instead of fixing the code. The skill ships seven gates to block that.
Gate names (the prompt blocks live in the reference):
  1. Invariant first — agent prints
    INVARIANT: …
    ,
    OWNING_LAYER: …
    ,
    EXISTING_SUITE: …
    before any test code.
  2. Owning layer — extend an existing suite; reject new files without a named invariant.
  3. Real execution — every new test must run against a real DB / route / external integration at least once.
  4. Failure → fix production — on red, the next tool call reads the production code, not the test.
  5. No snapshot without contract — classify the artifact as
    PRODUCT_CONTRACT
    or
    IMPLEMENTATION_DETAIL
    ; the latter forbids snapshots.
  6. No assertion on self-set mock — cannot assert on a value the same test body wrote into a mock.
  7. Negative companion — every positive assertion ships with a negative test for invalid input or failure mode.
STOP. Read
references/ai-writes-tests.md
in full before letting any agent generate, modify, or "fix" tests in this repository.
That file contains the verbatim prompt blocks to paste into CLAUDE.md, the failure-protocol template, the mock-budget rule, and the evidence base (Anthropic's eval guide, the Yoshimoto study on AI test smells, the Stanford vibe-coding vulnerability finding). The seven gate names above are not enforceable on their own — the agent-runnable prompt block lives only in the reference.
编码Agent默认会Mock所有内容以维持CI绿灯。它们生成的测试冗长线性、断言密度高且无边缘用例——还会随意修改测试而非修复代码。本技能提供7项准入规则来阻止这种情况。
规则名称(提示模板见参考文件):
  1. 先定义不变量——Agent在编写任何测试代码前,需输出
    INVARIANT: …
    OWNING_LAYER: …
    EXISTING_SUITE: …
  2. 明确所属层级——扩展现有测试套件;无命名不变量时拒绝创建新文件。
  3. 真实执行验证——每个新测试至少需针对真实数据库/路由/外部集成运行一次。
  4. 失败优先修复生产代码——测试失败时,下一步需读取生产代码而非修改测试。
  5. 无契约禁止快照——将工件分类为
    PRODUCT_CONTRACT
    IMPLEMENTATION_DETAIL
    ;后者禁止使用快照。
  6. 不得断言自设Mock——不得断言同一测试体写入Mock的值。
  7. 配套负面测试——每个正向断言需搭配针对无效输入或失败场景的负面测试。
停止操作。在让任何Agent生成、修改或“修复”本仓库中的测试前,请完整阅读
references/ai-writes-tests.md
该文件包含可直接粘贴到CLAUDE.md的完整提示模板、失败处理协议模板、Mock预算规则,以及证据基础(Anthropic的评估指南、Yoshimoto关于AI测试异味的研究、斯坦福的vibe-coding漏洞发现)。上述7条规则名称本身不具备可执行性——Agent可运行的提示模板仅存在于参考文件中。

CI & flaky discipline

CI与不稳定测试规范

Tests that pass intermittently are worse than tests that always fail. They train the team to ignore signal.
Gist tripwires:
  • Quarantine a flaky test the same hour it is detected. Assign a named human owner within 24 hours with a fix-by date. No anonymous quarantines.
  • Track
    flaky_rate
    as a first-class operational metric. SLO < 1–2%; alert at > 5%. Retry without telemetry is debt accrual, not stability.
  • Real systems at the final gate. Mock at unit; contract-test the boundary; real DB / queue / route at integration; near-zero mocks at E2E.
STOP. Read
references/ci-automation.md
in full before designing CI stages, introducing retries, choosing between integration and contract tests, adding property or mutation testing, or building a regression pack.
That file contains the flaky taxonomy (Async 45%, Concurrency 20%, Order 12%, …), the quarantine workflow, the CI stage pyramid, and the contract/property/mutation/accessibility patterns. The three tripwires above are detection cues, not the contract.
间歇性通过的测试比一直失败的测试更糟糕。它们会让团队逐渐忽略有效信号。
关键提示:
  • 检测到不稳定测试后的同一小时内将其隔离。24小时内指定明确的负责人及修复截止日期。禁止匿名隔离。
  • flaky_rate
    作为一级运营指标。服务水平目标(SLO)<1–2%;超过5%时触发告警。无遥测的重试是债务累积,而非稳定性提升。
  • 最终准入规则使用真实系统。单元测试用Mock;边界用契约测试;集成测试用真实数据库/队列/路由;端到端测试几乎不用Mock。
停止操作。在设计CI阶段、引入重试机制、选择集成测试与契约测试、添加属性或变异测试,或构建回归测试包前,请完整阅读
references/ci-automation.md
该文件包含不稳定测试分类法(异步45%、并发20%、顺序12%……)、隔离工作流、CI阶段金字塔,以及契约/属性/变异/可访问性测试模式。上述三个提示仅为检测线索,而非正式规范。

LLM and agent evaluation (Part 6, enxuta)

LLM与Agent评估(第六部分,精简版)

Testing an LLM feature is conventional testing with one twist: the oracle is probabilistic. Everything else — invariants, regression, traceability, real-system validation, repair-production — applies harder, not less.
Gist tripwires:
  • Start small. Twenty unambiguous tasks drawn from real failures beat two hundred synthetic ones.
  • Climb the oracle ladder: exact / schema / outcome-state checks before LLM-as-judge before human review. Use the cheapest oracle that catches the failure.
  • LLM-as-judge needs calibration. Validate against humans (target ≥ 0.80 Spearman) before trusting any judge in CI. Always use a different model than the system under test.
  • Agents need outcome checks. Trajectory grading punishes valid creativity; outcome-only grading misses ghost actions where the transcript claims success and nothing changed.
STOP. Read
references/llm-eval.md
in full before designing an eval suite, introducing LLM-as-judge into a pipeline, debating SWE-bench / τ-bench numbers, building RAG faithfulness checks, or rolling out trace-based observability.
That file contains the oracle ladder in full, the LLM-as-judge bias list, the RAG metric decomposition, agent trajectory-vs-outcome guidance, and the benchmark-pitfalls evidence (Anthropic eval guide, the 2507.02825 benchmark-validity paper, Galileo and Braintrust frameworks). The four tripwires above are detection cues, not the contract.
测试LLM功能是常规测试的变体:预言机具有概率性。其他所有原则——不变量、回归、可追溯性、真实系统验证、修复生产代码——都需更严格地应用,而非放松。
关键提示:
  • 从小规模开始。20个源自真实故障的明确任务优于200个合成任务。
  • 攀登预言机阶梯:先进行精确/ schema /状态结果检查,再使用LLM作为评判者,最后进行人工评审。使用能检测到故障的成本最低的预言机。
  • LLM作为评判者需要校准。在CI中信任评判者前,需与人工验证(目标≥0.80斯皮尔曼相关系数)。始终使用与被测系统不同的模型。
  • Agent需要结果检查。轨迹评分会惩罚有效的创造性;仅结果评分会遗漏日志声称成功但无实际变化的“幽灵操作”。
停止操作。在设计评估套件、将LLM作为评判者引入流水线、讨论SWE-bench/τ-bench数值、构建RAG忠实性检查,或推出基于轨迹的可观测性前,请完整阅读
references/llm-eval.md
该文件包含完整的预言机阶梯、LLM作为评判者的偏差列表、RAG指标分解、Agent轨迹与结果评估指南,以及基准测试陷阱证据(Anthropic评估指南、2507.02825基准有效性论文、Galileo和Braintrust框架)。上述四个提示仅为检测线索,而非正式规范。

Red flags (cross-cutting)

危险信号(跨场景)

These signals should always trigger "stop and think" — no matter the layer or framework.
  • Mock setup is larger than the test logic.
  • Test breaks when an internal method is renamed (not a public contract).
  • Removing the assertion body leaves the test still green.
  • Test fails when run with
    .only
    in isolation.
  • sleep
    ,
    Thread.sleep
    , or
    cy.wait(<number>)
    appears anywhere.
  • Selector contains a CSS class, an index, or
    xpath
    .
  • Test asserts a third-party site is reachable.
  • Snapshot diffs are accepted in code review without reading them.
  • Coverage percentage is the only quoted quality metric.
  • Failing tests are auto-retried until green; nobody investigates.
  • Skipped or quarantined tests have no named owner and no fix-by date.
  • Test depends on
    new Date()
    ,
    Math.random()
    , or system locale.
  • afterEach
    resets the database (move it to
    beforeEach
    ).
  • An AI-written test has six+ assertions and zero edge cases.
  • The phrase "I'll mock this to be safe" appears anywhere in the diff.
这些信号应始终触发“停止并思考”——无论层级或框架如何。
  • Mock设置比测试逻辑更复杂。
  • 重命名内部方法(非公共契约)导致测试失败。
  • 删除断言体后测试仍能通过。
  • 单独运行
    .only
    时测试失败。
  • 出现
    sleep
    Thread.sleep
    cy.wait(<number>)
  • 选择器包含CSS类、索引或
    xpath
  • 测试断言第三方站点可访问。
  • 代码评审中未阅读就接受快照差异。
  • 仅引用覆盖率百分比作为质量指标。
  • 失败测试自动重试直到通过;无人调查原因。
  • 跳过或隔离的测试无明确负责人及修复截止日期。
  • 测试依赖
    new Date()
    Math.random()
    或系统区域设置。
  • afterEach
    中重置数据库(移至
    beforeEach
    )。
  • AI编写的测试有6个以上断言且无边缘用例。
  • 差异中出现“我Mock这个以确保安全”的表述。

When NOT to use this skill

何时不应使用本技能

  • General code review unrelated to tests — use a code-review skill.
  • Library-specific debugging where the test is just the reproduction — use the library's own debugging skill.
  • Non-testing CI pipeline design (deploys, artifact promotion, secrets management).
  • Production observability and alerting design — those are runtime concerns, not test concerns.
  • Single-line typo fixes in existing tests — the doctrine is for non-trivial work.
  • 与测试无关的通用代码评审——使用代码评审技能。
  • 库专属调试(测试仅用于复现问题)——使用库自身的调试技能。
  • 非测试类CI流水线设计(部署、制品推广、密钥管理)。
  • 生产环境可观测性与告警设计——这些是运行时问题,而非测试问题。
  • 现有测试中的单行拼写错误修复——本准则适用于非 trivial 工作。

Bottom line

核心结论

A test that cannot fail is decorative. A test that fails for the wrong
reason is misleading. Build tests that fail for exactly one reason —
the reason the invariant was violated — and trust them when they do.

Mocks isolate. Real systems validate. Coverage shines a light. Mutation
score grades the suite. Agents will reach for the mock and the snapshot;
the gates here make them put both down.

Tests reveal bugs, not just pass.
无法失败的测试只是装饰。因错误原因失败的测试具有误导性。构建仅因一种原因失败的测试——即不变量被违反的原因——并在失败时信任它们。

Mock用于隔离。真实系统用于验证。覆盖率照亮问题。变异分数评估套件强度。Agent会优先选择Mock和快照;本准则的准入规则会让它们放弃这两种方式。

测试的使命是发现漏洞,而非仅仅通过。