test-flakiness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTest Flakiness Detection
Flaky测试检测
A flaky test is one that sometimes passes and sometimes fails without any code
change. Flaky tests are worse than no tests in some ways — they train the team
to ignore red CI runs, masking genuine failures. This skill identifies them,
explains likely causes, and recommends whether to quarantine or fix each one.
Output: Updated quarantine section + optional
tests/regression-suite.mdproduction/qa/flakiness-report-[date].mdWhen to run:
- Polish phase (tests have had many runs; statistical signal is reliable)
- When developers start dismissing CI failures as "probably flaky"
- After identifies quarantined tests that need diagnosis
/regression-suite
Flaky测试是指在无代码变更的情况下,时而通过时而失败的测试。在某些方面,Flaky测试比没有测试更糟糕——它们会让团队习惯忽略CI的红色运行状态,掩盖真正的故障。本技能可识别这类测试,解释可能的原因,并建议对每个测试进行隔离或修复。
输出: 更新中的隔离部分 + 可选的文件
tests/regression-suite.mdproduction/qa/flakiness-report-[date].md运行时机:
- 打磨阶段(Polish phase):测试已多次运行,统计信号可靠
- 当开发者开始将CI失败视为“可能是Flaky测试导致”时
- 在识别出需要诊断的隔离测试之后
/regression-suite
1. Parse Arguments
1. 解析参数
Modes:
- — analyse a specific CI run log file
/test-flakiness [ci-log-path] - — scan all available CI logs in
/test-flakiness scanor standard log output directories.github/ - — read existing regression-suite.md quarantine section and provide remediation guidance for already-known flaky tests
/test-flakiness registry - No argument — auto-detect: run if CI logs are accessible, else
scanregistry
模式:
- —— 分析指定的CI运行日志文件
/test-flakiness [ci-log-path] - —— 扫描
/test-flakiness scan或标准日志输出目录中的所有可用CI日志.github/ - —— 读取现有
/test-flakiness registry的隔离部分,为已知的Flaky测试提供修复指导regression-suite.md - 无参数 —— 自动检测:如果可访问CI日志则运行,否则运行
scanregistry
2. Locate CI Log Data
2. 定位CI日志数据
Option A — GitHub Actions (preferred)
选项A —— GitHub Actions(首选)
Check for test result artifacts:
bash
ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/nullFor Godot projects: GdUnit4 outputs XML results compatible with JUnit format.
Check for files.
test-results/.xmlFor Unity projects: game-ci test runner outputs NUnit XML to
by default.
test-results/For Unreal projects: automation logs go to . Grep for
and patterns.
Saved/Logs/Result: SuccessResult: Fail检查测试结果工件:
bash
ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null对于Godot项目:GdUnit4输出兼容JUnit格式的XML结果。检查目录下的文件。
test-results/.xml对于Unity项目:game-ci测试运行器默认将NUnit XML输出到目录。
test-results/对于Unreal项目:自动化日志存储在目录中。搜索和模式。
Saved/Logs/Result: SuccessResult: FailOption B — Local log files
选项B —— 本地日志文件
If a path argument is provided, read that file directly.
如果提供了路径参数,直接读取该文件。
Option C — No log data available
选项C —— 无可用日志数据
If no logs found:
"No CI log data found. To detect flaky tests, this skill needs test result history from multiple runs. Options:
- Run the test suite at least 3 times and collect the output logs
- Check CI pipeline output and save a log to
test-results/- Run
to review tests already flagged as flaky in/test-flakiness registry"tests/regression-suite.md
Stop and ask the user which option to pursue.
如果未找到日志:
"未找到CI日志数据。要检测Flaky测试,本技能需要来自多次运行的测试结果历史。可选方案:
- 至少运行测试套件3次并收集输出日志
- 检查CI流水线输出并将日志保存到
目录test-results/- 运行
查看/test-flakiness registry中已标记为Flaky的测试"tests/regression-suite.md
停止操作并询问用户选择哪种方案。
3. Parse Test Results
3. 解析测试结果
For each CI log or result file found, parse:
JUnit XML format (GdUnit4 / Unity):
- Grep for to get test names
<testcase name= - Grep for or
<failureto identify failures<error - Parse and
classnameattributes for full test identifiersname
Plain text logs:
- Grep for pass/fail patterns:
- Godot: /
PASSEDadjacent to test namesFAILED - Unreal: /
Result: SuccessResult: Fail - Unity: /
Test passedTest failed
- Godot:
Build a table:
test_id → [run1_result, run2_result, run3_result, ...]对于找到的每个CI日志或结果文件,解析内容:
JUnit XML格式(GdUnit4 / Unity):
- 搜索获取测试名称
<testcase name= - 搜索或
<failure识别失败案例<error - 解析和
classname属性以获取完整的测试标识符name
纯文本日志:
- 搜索通过/失败模式:
- Godot:测试名称附近的/
PASSEDFAILED - Unreal:/
Result: SuccessResult: Fail - Unity:/
Test passedTest failed
- Godot:测试名称附近的
构建表格:
test_id → [run1_result, run2_result, run3_result, ...]4. Identify Flaky Tests
4. 识别Flaky测试
A test is flaky if it appears in the result history with both PASS and
FAIL outcomes across runs with no code changes between them.
Flakiness thresholds:
- High flakiness: Fails in >25% of runs — quarantine immediately
- Moderate flakiness: Fails in 5–25% of runs — investigate and fix soon
- Low/suspected flakiness: Fails in 1–5% of runs — monitor; may be genuinely rare failure
For each flaky test, classify the likely cause:
如果一个测试在无代码变更的多次运行中同时出现PASS和FAIL结果,则该测试为Flaky测试。
不稳定程度阈值:
- 高度不稳定:失败率>25% —— 立即隔离
- 中度不稳定:失败率5–25% —— 尽快调查并修复
- 低度/疑似不稳定:失败率1–5% —— 监控;可能是真正罕见的故障
为每个Flaky测试分类可能的原因:
Cause classification
原因分类
| Cause | Symptoms | Fix direction |
|---|---|---|
| Timing / async | Fails after awaiting signals or timers; pass rate correlates with system load | Add explicit await/synchronisation; avoid time-based delays |
| Order dependency | Fails when run after specific other tests; passes in isolation | Add proper setup/teardown; ensure test isolation |
| Random seed | Fails intermittently with no pattern; involves RNG | Pass explicit seed; don't use |
| Resource leak | Fails more often later in a test run | Fix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity) |
| External state | Fails when a file, scene, or global exists from a prior test | Isolate test from file system; use in-memory mocks |
| Floating point | Fails on comparisons like | Use epsilon comparison ( |
| Scene/prefab load race | Fails when scenes are not yet ready | Await one frame after instantiation; use |
Use Grep to check the test file for timing calls, randf, global state access,
or equality comparisons on floats to narrow down the cause.
| 原因 | 症状 | 修复方向 |
|---|---|---|
| 时序/异步问题 | 在等待信号或计时器后失败;通过率与系统负载相关 | 添加显式等待/同步机制;避免基于时间的延迟 |
| 顺序依赖 | 在特定其他测试之后运行时失败;单独运行时通过 | 添加适当的前置/后置处理;确保测试隔离 |
| 随机种子 | 无规律地间歇性失败;涉及随机数生成(RNG) | 传递显式种子;测试中不使用 |
| 资源泄漏 | 在测试运行后期失败频率更高 | 修复后置处理中的清理逻辑;检查孤立节点(Godot)或对象释放(Unity) |
| 外部状态依赖 | 当前置测试留下文件、场景或全局状态时失败 | 使测试与文件系统隔离;使用内存模拟(mock) |
| 浮点数比较 | 在 | 使用epsilon比较( |
| 场景/预制件加载竞争 | 场景尚未准备好时失败 | 实例化后等待一帧;使用 |
使用Grep检查测试文件中的时序调用、randf、全局状态访问或浮点数相等比较,以缩小原因范围。
5. Recommend Action
5. 建议操作
For each flaky test:
Quarantine (High flakiness):
"Quarantine this test immediately. Disable it in CI by adding/@pytest.mark.skip/[Ignore]annotation. Log it inGdUnitSkipquarantine section. The test is now opt-in only. Fix the root cause before removing quarantine."tests/regression-suite.md
Investigate and fix soon (Moderate):
"This test is intermittently unreliable. Root cause appears to be [cause]. Suggested fix: [specific fix based on cause classification]. Do not quarantine yet — fix the test directly."
Monitor (Low/suspected):
"This test shows suspected flakiness. Collect more run data before quarantining. Note it as 'suspected' in the regression suite."
针对每个Flaky测试:
隔离(高度不稳定):
"立即隔离此测试。通过添加/@pytest.mark.skip/[Ignore]注解在CI中禁用它。将其记录在GdUnitSkip的隔离部分。该测试现在仅可选择启用。修复根本原因后再解除隔离。"tests/regression-suite.md
尽快调查并修复(中度):
"此测试间歇性不可靠。根本原因似乎是[原因]。建议修复方案:[基于原因分类的具体修复方法]。暂不隔离——直接修复测试。"
监控(低度/疑似):
"此测试存在疑似不稳定问题。在隔离前收集更多运行数据。在回归套件中将其标记为“疑似”。"
6. Generate Reports
6. 生成报告
In-conversation summary
对话内摘要
undefinedundefinedFlakiness Detection Results
不稳定测试检测结果
Runs analysed: [N]
Tests tracked: [N]
分析的运行次数: [N]
跟踪的测试数量: [N]
Flaky Tests Found
发现的Flaky测试
| Test | System | Fail Rate | Likely Cause | Recommendation |
|---|---|---|---|---|
| [test_name] | [system] | [N]% | Timing | Quarantine + fix async |
| [test_name] | [system] | [N]% | Float comparison | Fix: use epsilon compare |
| [test_name] | [system] | [N]% | Order dependency | Investigate teardown |
| 测试名称 | 系统 | 失败率 | 可能原因 | 建议操作 |
|---|---|---|---|---|
| [test_name] | [system] | [N]% | 时序问题 | 隔离 + 修复异步逻辑 |
| [test_name] | [system] | [N]% | 浮点数比较 | 修复:使用epsilon比较 |
| [test_name] | [system] | [N]% | 顺序依赖 | 调查后置处理逻辑 |
Clean Tests (no flakiness detected)
稳定测试(未检测到不稳定)
[N] tests ran across [N] runs with consistent results — no flakiness detected.
[N]个测试在[N]次运行中结果一致——未检测到不稳定问题。
Data Limitations
数据局限性
[Note if fewer than 5 runs were available — fewer runs = less statistical confidence]
---[如果可用运行次数少于5次,请注明——运行次数越少,统计置信度越低]
---7. Update Regression Suite + Optional Report File
7. 更新回归套件 + 可选报告文件
Ask: "May I update the quarantine section of
with the flaky tests found?"
tests/regression-suite.mdIf yes: use to append entries to the Quarantined Tests table.
Never remove existing quarantine entries — only add new ones.
EditAsk (separately): "May I write a full flakiness report to
?"
production/qa/flakiness-report-[date].mdThe full report includes per-test analysis with cause details and
engine-specific fix snippets.
After writing:
- For each quarantined test: "Add the engine-specific skip annotation to disable this test in CI. Re-enable after the root cause is fixed."
- For fix-eligible tests: "The fix for [test] is straightforward —
change the equality comparison on line [N] to use ."
is_equal_approx - Summary: "Once all quarantine annotations are applied, CI should run green. Schedule fix work for the [N] quarantined tests before the release gate."
询问:“是否允许我更新的隔离部分,添加发现的Flaky测试?”
tests/regression-suite.md如果同意:使用操作将条目追加到隔离测试表格中。切勿删除现有隔离条目——仅添加新条目。
Edit(单独)询问:“是否允许我将完整的不稳定测试报告写入文件?”
production/qa/flakiness-report-[date].md完整报告包含每个测试的分析、原因详情以及引擎特定的修复代码片段。
写入完成后:
- 对于每个隔离测试:“添加引擎特定的跳过注解以在CI中禁用此测试。修复根本原因后重新启用。”
- 对于可修复的测试:“[测试名称]的修复方法很简单——将第[N]行的相等比较改为使用。”
is_equal_approx - 总结:“应用所有隔离注解后,CI应能正常运行。在发布关口前安排[N]个隔离测试的修复工作。”
Collaborative Protocol
协作协议
- Never delete test files — quarantine means annotate + list, not remove
- Statistical confidence matters — with < 3 runs, flag findings as "suspected" not "confirmed"; ask if more run data is available
- Fix is always the goal — quarantine is temporary; surface the fix direction even when recommending quarantine
- Ask before writing — both the regression-suite update and the report file require explicit approval. On write: Verdict: COMPLETE — flakiness report written. On decline: Verdict: BLOCKED — user declined write.
- Flakiness in CI is a team problem — surface the list and recommended actions clearly; do not just silently quarantine without the team knowing
- 切勿删除测试文件——隔离意味着添加注解并记录,而非删除
- 统计置信度至关重要——如果运行次数<3次,将结果标记为“疑似”而非“已确认”;询问是否有更多运行数据
- 修复始终是目标——隔离是临时措施;即使建议隔离,也要明确修复方向
- 写入前需询问——更新回归套件和写入报告文件都需要明确批准。写入成功时: verdict: COMPLETE —— 不稳定测试报告已写入。用户拒绝时: verdict: BLOCKED —— 用户拒绝写入操作。
- CI中的不稳定测试是团队问题——清晰列出测试和建议操作;不要在团队不知情的情况下静默隔离测试