test-flakiness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Test Flakiness Detection

Flaky测试检测

A flaky test is one that sometimes passes and sometimes fails without any code change. Flaky tests are worse than no tests in some ways — they train the team to ignore red CI runs, masking genuine failures. This skill identifies them, explains likely causes, and recommends whether to quarantine or fix each one.
Output: Updated
tests/regression-suite.md
quarantine section + optional
production/qa/flakiness-report-[date].md
When to run:
  • Polish phase (tests have had many runs; statistical signal is reliable)
  • When developers start dismissing CI failures as "probably flaky"
  • After
    /regression-suite
    identifies quarantined tests that need diagnosis

Flaky测试是指在无代码变更的情况下,时而通过时而失败的测试。在某些方面,Flaky测试比没有测试更糟糕——它们会让团队习惯忽略CI的红色运行状态,掩盖真正的故障。本技能可识别这类测试,解释可能的原因,并建议对每个测试进行隔离或修复。
输出: 更新
tests/regression-suite.md
中的隔离部分 + 可选的
production/qa/flakiness-report-[date].md
文件
运行时机:
  • 打磨阶段(Polish phase):测试已多次运行,统计信号可靠
  • 当开发者开始将CI失败视为“可能是Flaky测试导致”时
  • /regression-suite
    识别出需要诊断的隔离测试之后

1. Parse Arguments

1. 解析参数

Modes:
  • /test-flakiness [ci-log-path]
    — analyse a specific CI run log file
  • /test-flakiness scan
    — scan all available CI logs in
    .github/
    or standard log output directories
  • /test-flakiness registry
    — read existing regression-suite.md quarantine section and provide remediation guidance for already-known flaky tests
  • No argument — auto-detect: run
    scan
    if CI logs are accessible, else
    registry

模式:
  • /test-flakiness [ci-log-path]
    —— 分析指定的CI运行日志文件
  • /test-flakiness scan
    —— 扫描
    .github/
    或标准日志输出目录中的所有可用CI日志
  • /test-flakiness registry
    —— 读取现有
    regression-suite.md
    的隔离部分,为已知的Flaky测试提供修复指导
  • 无参数 —— 自动检测:如果可访问CI日志则运行
    scan
    ,否则运行
    registry

2. Locate CI Log Data

2. 定位CI日志数据

Option A — GitHub Actions (preferred)

选项A —— GitHub Actions(首选)

Check for test result artifacts:
bash
ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null
For Godot projects: GdUnit4 outputs XML results compatible with JUnit format. Check
test-results/
for
.xml
files.
For Unity projects: game-ci test runner outputs NUnit XML to
test-results/
by default.
For Unreal projects: automation logs go to
Saved/Logs/
. Grep for
Result: Success
and
Result: Fail
patterns.
检查测试结果工件:
bash
ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null
对于Godot项目:GdUnit4输出兼容JUnit格式的XML结果。检查
test-results/
目录下的
.xml
文件。
对于Unity项目:game-ci测试运行器默认将NUnit XML输出到
test-results/
目录。
对于Unreal项目:自动化日志存储在
Saved/Logs/
目录中。搜索
Result: Success
Result: Fail
模式。

Option B — Local log files

选项B —— 本地日志文件

If a path argument is provided, read that file directly.
如果提供了路径参数,直接读取该文件。

Option C — No log data available

选项C —— 无可用日志数据

If no logs found:
"No CI log data found. To detect flaky tests, this skill needs test result history from multiple runs. Options:
  1. Run the test suite at least 3 times and collect the output logs
  2. Check CI pipeline output and save a log to
    test-results/
  3. Run
    /test-flakiness registry
    to review tests already flagged as flaky in
    tests/regression-suite.md
    "
Stop and ask the user which option to pursue.

如果未找到日志:
"未找到CI日志数据。要检测Flaky测试,本技能需要来自多次运行的测试结果历史。可选方案:
  1. 至少运行测试套件3次并收集输出日志
  2. 检查CI流水线输出并将日志保存到
    test-results/
    目录
  3. 运行
    /test-flakiness registry
    查看
    tests/regression-suite.md
    中已标记为Flaky的测试"
停止操作并询问用户选择哪种方案。

3. Parse Test Results

3. 解析测试结果

For each CI log or result file found, parse:
JUnit XML format (GdUnit4 / Unity):
  • Grep for
    <testcase name=
    to get test names
  • Grep for
    <failure
    or
    <error
    to identify failures
  • Parse
    classname
    and
    name
    attributes for full test identifiers
Plain text logs:
  • Grep for pass/fail patterns:
    • Godot:
      PASSED
      /
      FAILED
      adjacent to test names
    • Unreal:
      Result: Success
      /
      Result: Fail
    • Unity:
      Test passed
      /
      Test failed
Build a table:
test_id → [run1_result, run2_result, run3_result, ...]

对于找到的每个CI日志或结果文件,解析内容:
JUnit XML格式(GdUnit4 / Unity):
  • 搜索
    <testcase name=
    获取测试名称
  • 搜索
    <failure
    <error
    识别失败案例
  • 解析
    classname
    name
    属性以获取完整的测试标识符
纯文本日志
  • 搜索通过/失败模式:
    • Godot:测试名称附近的
      PASSED
      /
      FAILED
    • Unreal:
      Result: Success
      /
      Result: Fail
    • Unity:
      Test passed
      /
      Test failed
构建表格:
test_id → [run1_result, run2_result, run3_result, ...]

4. Identify Flaky Tests

4. 识别Flaky测试

A test is flaky if it appears in the result history with both PASS and FAIL outcomes across runs with no code changes between them.
Flakiness thresholds:
  • High flakiness: Fails in >25% of runs — quarantine immediately
  • Moderate flakiness: Fails in 5–25% of runs — investigate and fix soon
  • Low/suspected flakiness: Fails in 1–5% of runs — monitor; may be genuinely rare failure
For each flaky test, classify the likely cause:
如果一个测试在无代码变更的多次运行中同时出现PASS和FAIL结果,则该测试为Flaky测试
不稳定程度阈值:
  • 高度不稳定:失败率>25% —— 立即隔离
  • 中度不稳定:失败率5–25% —— 尽快调查并修复
  • 低度/疑似不稳定:失败率1–5% —— 监控;可能是真正罕见的故障
为每个Flaky测试分类可能的原因:

Cause classification

原因分类

CauseSymptomsFix direction
Timing / asyncFails after awaiting signals or timers; pass rate correlates with system loadAdd explicit await/synchronisation; avoid time-based delays
Order dependencyFails when run after specific other tests; passes in isolationAdd proper setup/teardown; ensure test isolation
Random seedFails intermittently with no pattern; involves RNGPass explicit seed; don't use
randf()
in tests
Resource leakFails more often later in a test runFix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity)
External stateFails when a file, scene, or global exists from a prior testIsolate test from file system; use in-memory mocks
Floating pointFails on comparisons like
== 0.5
Use epsilon comparison (
is_equal_approx
,
Assert.AreApproximately
)
Scene/prefab load raceFails when scenes are not yet readyAwait one frame after instantiation; use
await get_tree().process_frame
Use Grep to check the test file for timing calls, randf, global state access, or equality comparisons on floats to narrow down the cause.

原因症状修复方向
时序/异步问题在等待信号或计时器后失败;通过率与系统负载相关添加显式等待/同步机制;避免基于时间的延迟
顺序依赖在特定其他测试之后运行时失败;单独运行时通过添加适当的前置/后置处理;确保测试隔离
随机种子无规律地间歇性失败;涉及随机数生成(RNG)传递显式种子;测试中不使用
randf()
资源泄漏在测试运行后期失败频率更高修复后置处理中的清理逻辑;检查孤立节点(Godot)或对象释放(Unity)
外部状态依赖当前置测试留下文件、场景或全局状态时失败使测试与文件系统隔离;使用内存模拟(mock)
浮点数比较
== 0.5
这类比较中失败
使用epsilon比较(
is_equal_approx
Assert.AreApproximately
场景/预制件加载竞争场景尚未准备好时失败实例化后等待一帧;使用
await get_tree().process_frame
使用Grep检查测试文件中的时序调用、randf、全局状态访问或浮点数相等比较,以缩小原因范围。

5. Recommend Action

5. 建议操作

For each flaky test:
Quarantine (High flakiness):
"Quarantine this test immediately. Disable it in CI by adding
@pytest.mark.skip
/
[Ignore]
/
GdUnitSkip
annotation. Log it in
tests/regression-suite.md
quarantine section. The test is now opt-in only. Fix the root cause before removing quarantine."
Investigate and fix soon (Moderate):
"This test is intermittently unreliable. Root cause appears to be [cause]. Suggested fix: [specific fix based on cause classification]. Do not quarantine yet — fix the test directly."
Monitor (Low/suspected):
"This test shows suspected flakiness. Collect more run data before quarantining. Note it as 'suspected' in the regression suite."

针对每个Flaky测试:
隔离(高度不稳定):
"立即隔离此测试。通过添加
@pytest.mark.skip
/
[Ignore]
/
GdUnitSkip
注解在CI中禁用它。将其记录在
tests/regression-suite.md
的隔离部分。该测试现在仅可选择启用。修复根本原因后再解除隔离。"
尽快调查并修复(中度):
"此测试间歇性不可靠。根本原因似乎是[原因]。建议修复方案:[基于原因分类的具体修复方法]。暂不隔离——直接修复测试。"
监控(低度/疑似):
"此测试存在疑似不稳定问题。在隔离前收集更多运行数据。在回归套件中将其标记为“疑似”。"

6. Generate Reports

6. 生成报告

In-conversation summary

对话内摘要

undefined
undefined

Flakiness Detection Results

不稳定测试检测结果

Runs analysed: [N] Tests tracked: [N]
分析的运行次数: [N] 跟踪的测试数量: [N]

Flaky Tests Found

发现的Flaky测试

TestSystemFail RateLikely CauseRecommendation
[test_name][system][N]%TimingQuarantine + fix async
[test_name][system][N]%Float comparisonFix: use epsilon compare
[test_name][system][N]%Order dependencyInvestigate teardown
测试名称系统失败率可能原因建议操作
[test_name][system][N]%时序问题隔离 + 修复异步逻辑
[test_name][system][N]%浮点数比较修复:使用epsilon比较
[test_name][system][N]%顺序依赖调查后置处理逻辑

Clean Tests (no flakiness detected)

稳定测试(未检测到不稳定)

[N] tests ran across [N] runs with consistent results — no flakiness detected.
[N]个测试在[N]次运行中结果一致——未检测到不稳定问题。

Data Limitations

数据局限性

[Note if fewer than 5 runs were available — fewer runs = less statistical confidence]

---
[如果可用运行次数少于5次,请注明——运行次数越少,统计置信度越低]

---

7. Update Regression Suite + Optional Report File

7. 更新回归套件 + 可选报告文件

Ask: "May I update the quarantine section of
tests/regression-suite.md
with the flaky tests found?"
If yes: use
Edit
to append entries to the Quarantined Tests table. Never remove existing quarantine entries — only add new ones.
Ask (separately): "May I write a full flakiness report to
production/qa/flakiness-report-[date].md
?"
The full report includes per-test analysis with cause details and engine-specific fix snippets.
After writing:
  • For each quarantined test: "Add the engine-specific skip annotation to disable this test in CI. Re-enable after the root cause is fixed."
  • For fix-eligible tests: "The fix for [test] is straightforward — change the equality comparison on line [N] to use
    is_equal_approx
    ."
  • Summary: "Once all quarantine annotations are applied, CI should run green. Schedule fix work for the [N] quarantined tests before the release gate."

询问:“是否允许我更新
tests/regression-suite.md
的隔离部分,添加发现的Flaky测试?”
如果同意:使用
Edit
操作将条目追加到隔离测试表格中。切勿删除现有隔离条目——仅添加新条目。
(单独)询问:“是否允许我将完整的不稳定测试报告写入
production/qa/flakiness-report-[date].md
文件?”
完整报告包含每个测试的分析、原因详情以及引擎特定的修复代码片段。
写入完成后:
  • 对于每个隔离测试:“添加引擎特定的跳过注解以在CI中禁用此测试。修复根本原因后重新启用。”
  • 对于可修复的测试:“[测试名称]的修复方法很简单——将第[N]行的相等比较改为使用
    is_equal_approx
    。”
  • 总结:“应用所有隔离注解后,CI应能正常运行。在发布关口前安排[N]个隔离测试的修复工作。”

Collaborative Protocol

协作协议

  • Never delete test files — quarantine means annotate + list, not remove
  • Statistical confidence matters — with < 3 runs, flag findings as "suspected" not "confirmed"; ask if more run data is available
  • Fix is always the goal — quarantine is temporary; surface the fix direction even when recommending quarantine
  • Ask before writing — both the regression-suite update and the report file require explicit approval. On write: Verdict: COMPLETE — flakiness report written. On decline: Verdict: BLOCKED — user declined write.
  • Flakiness in CI is a team problem — surface the list and recommended actions clearly; do not just silently quarantine without the team knowing
  • 切勿删除测试文件——隔离意味着添加注解并记录,而非删除
  • 统计置信度至关重要——如果运行次数<3次,将结果标记为“疑似”而非“已确认”;询问是否有更多运行数据
  • 修复始终是目标——隔离是临时措施;即使建议隔离,也要明确修复方向
  • 写入前需询问——更新回归套件和写入报告文件都需要明确批准。写入成功时: verdict: COMPLETE —— 不稳定测试报告已写入。用户拒绝时: verdict: BLOCKED —— 用户拒绝写入操作。
  • CI中的不稳定测试是团队问题——清晰列出测试和建议操作;不要在团队不知情的情况下静默隔离测试