test-flakiness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Test Flakiness Detection

Flaky测试检测

A flaky test is one that sometimes passes and sometimes fails without any code change. Flaky tests are worse than no tests in some ways — they train the team to ignore red CI runs, masking genuine failures. This skill identifies them, explains likely causes, and recommends whether to quarantine or fix each one.

Output: Updated

tests/regression-suite.md

quarantine section + optional

production/qa/flakiness-report-[date].md

When to run:

Polish phase (tests have had many runs; statistical signal is reliable)
When developers start dismissing CI failures as "probably flaky"
After
```
/regression-suite
```
identifies quarantined tests that need diagnosis

Flaky测试是指在无代码变更的情况下，时而通过时而失败的测试。在某些方面，Flaky测试比没有测试更糟糕——它们会让团队习惯忽略CI的红色运行状态，掩盖真正的故障。本技能可识别这类测试，解释可能的原因，并建议对每个测试进行隔离或修复。

输出： 更新

tests/regression-suite.md

中的隔离部分 + 可选的

production/qa/flakiness-report-[date].md

文件

运行时机：

打磨阶段（Polish phase）：测试已多次运行，统计信号可靠
当开发者开始将CI失败视为“可能是Flaky测试导致”时
在
```
/regression-suite
```
识别出需要诊断的隔离测试之后

1. Parse Arguments

1. 解析参数

Modes:

```
/test-flakiness [ci-log-path]
```
— analyse a specific CI run log file
```
/test-flakiness scan
```
— scan all available CI logs in
```
.github/
```
or standard log output directories
```
/test-flakiness registry
```
— read existing regression-suite.md quarantine section and provide remediation guidance for already-known flaky tests
No argument — auto-detect: run
```
scan
```
if CI logs are accessible, else
```
registry
```

模式：

```
/test-flakiness [ci-log-path]
```
—— 分析指定的CI运行日志文件
```
/test-flakiness scan
```
—— 扫描
```
.github/
```
或标准日志输出目录中的所有可用CI日志
```
/test-flakiness registry
```
—— 读取现有
```
regression-suite.md
```
的隔离部分，为已知的Flaky测试提供修复指导
无参数 —— 自动检测：如果可访问CI日志则运行
```
scan
```
，否则运行
```
registry
```

2. Locate CI Log Data

2. 定位CI日志数据

Option A — GitHub Actions (preferred)

选项A —— GitHub Actions（首选）

Check for test result artifacts:

bash

ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null

For Godot projects: GdUnit4 outputs XML results compatible with JUnit format. Check

test-results/

for

.xml

files.

For Unity projects: game-ci test runner outputs NUnit XML to

test-results/

by default.

For Unreal projects: automation logs go to

Saved/Logs/

. Grep for

Result: Success

and

Result: Fail

patterns.

检查测试结果工件：

bash

ls -t .github/ 2>/dev/null
ls -t test-results/ 2>/dev/null

对于Godot项目：GdUnit4输出兼容JUnit格式的XML结果。检查

test-results/

目录下的

.xml

文件。

对于Unity项目：game-ci测试运行器默认将NUnit XML输出到

test-results/

目录。

对于Unreal项目：自动化日志存储在

Saved/Logs/

目录中。搜索

Result: Success

和

Result: Fail

模式。

Option B — Local log files

选项B —— 本地日志文件

If a path argument is provided, read that file directly.

如果提供了路径参数，直接读取该文件。

Option C — No log data available

选项C —— 无可用日志数据

If no logs found:

"No CI log data found. To detect flaky tests, this skill needs test result history from multiple runs. Options:
Run the test suite at least 3 times and collect the output logs
Check CI pipeline output and save a log to
test-results/
Run
/test-flakiness registry
to review tests already flagged as flaky in
tests/regression-suite.md
"

Stop and ask the user which option to pursue.

如果未找到日志：

"未找到CI日志数据。要检测Flaky测试，本技能需要来自多次运行的测试结果历史。可选方案：
至少运行测试套件3次并收集输出日志
检查CI流水线输出并将日志保存到
test-results/
目录
运行
/test-flakiness registry
查看
tests/regression-suite.md
中已标记为Flaky的测试"

停止操作并询问用户选择哪种方案。

3. Parse Test Results

3. 解析测试结果

For each CI log or result file found, parse:

JUnit XML format (GdUnit4 / Unity):

Grep for
```
<testcase name=
```
to get test names
Grep for
```
<failure
```
or
```
<error
```
to identify failures
Parse
```
classname
```
and
```
name
```
attributes for full test identifiers

Plain text logs:

Grep for pass/fail patterns:
- Godot:
```
PASSED
```
  /
```
FAILED
```
  adjacent to test names
- Unreal:
```
Result: Success
```
  /
```
Result: Fail
```
- Unity:
```
Test passed
```
  /
```
Test failed
```

Build a table:

test_id → [run1_result, run2_result, run3_result, ...]

对于找到的每个CI日志或结果文件，解析内容：

JUnit XML格式（GdUnit4 / Unity）：

搜索
```
<testcase name=
```
获取测试名称
搜索
```
<failure
```
或
```
<error
```
识别失败案例
解析
```
classname
```
和
```
name
```
属性以获取完整的测试标识符

纯文本日志：

搜索通过/失败模式：
- Godot：测试名称附近的
```
PASSED
```
  /
```
FAILED
```
- Unreal：
```
Result: Success
```
  /
```
Result: Fail
```
- Unity：
```
Test passed
```
  /
```
Test failed
```

构建表格：

test_id → [run1_result, run2_result, run3_result, ...]

4. Identify Flaky Tests

4. 识别Flaky测试

A test is flaky if it appears in the result history with both PASS and FAIL outcomes across runs with no code changes between them.

Flakiness thresholds:

High flakiness: Fails in >25% of runs — quarantine immediately
Moderate flakiness: Fails in 5–25% of runs — investigate and fix soon
Low/suspected flakiness: Fails in 1–5% of runs — monitor; may be genuinely rare failure

For each flaky test, classify the likely cause:

如果一个测试在无代码变更的多次运行中同时出现PASS和FAIL结果，则该测试为Flaky测试。

不稳定程度阈值：

高度不稳定：失败率>25% —— 立即隔离
中度不稳定：失败率5–25% —— 尽快调查并修复
低度/疑似不稳定：失败率1–5% —— 监控；可能是真正罕见的故障

为每个Flaky测试分类可能的原因：

Cause classification

原因分类

Cause	Symptoms	Fix direction
Timing / async	Fails after awaiting signals or timers; pass rate correlates with system load	Add explicit await/synchronisation; avoid time-based delays
Order dependency	Fails when run after specific other tests; passes in isolation	Add proper setup/teardown; ensure test isolation
Random seed	Fails intermittently with no pattern; involves RNG	Pass explicit seed; don't use `randf()` in tests
Resource leak	Fails more often later in a test run	Fix cleanup in teardown; check orphan nodes (Godot) or object disposal (Unity)
External state	Fails when a file, scene, or global exists from a prior test	Isolate test from file system; use in-memory mocks
Floating point	Fails on comparisons like `== 0.5`	Use epsilon comparison ( `is_equal_approx` , `Assert.AreApproximately` )
Scene/prefab load race	Fails when scenes are not yet ready	Await one frame after instantiation; use `await get_tree().process_frame`

Use Grep to check the test file for timing calls, randf, global state access, or equality comparisons on floats to narrow down the cause.

原因	症状	修复方向
时序/异步问题	在等待信号或计时器后失败；通过率与系统负载相关	添加显式等待/同步机制；避免基于时间的延迟
顺序依赖	在特定其他测试之后运行时失败；单独运行时通过	添加适当的前置/后置处理；确保测试隔离
随机种子	无规律地间歇性失败；涉及随机数生成（RNG）	传递显式种子；测试中不使用 `randf()`
资源泄漏	在测试运行后期失败频率更高	修复后置处理中的清理逻辑；检查孤立节点（Godot）或对象释放（Unity）
外部状态依赖	当前置测试留下文件、场景或全局状态时失败	使测试与文件系统隔离；使用内存模拟（mock）
浮点数比较	在 `== 0.5` 这类比较中失败	使用epsilon比较（ `is_equal_approx` 、 `Assert.AreApproximately` ）
场景/预制件加载竞争	场景尚未准备好时失败	实例化后等待一帧；使用 `await get_tree().process_frame`

使用Grep检查测试文件中的时序调用、randf、全局状态访问或浮点数相等比较，以缩小原因范围。

5. Recommend Action

5. 建议操作

For each flaky test:

Quarantine (High flakiness):

"Quarantine this test immediately. Disable it in CI by adding
@pytest.mark.skip
/
[Ignore]
/
GdUnitSkip
annotation. Log it in
tests/regression-suite.md
quarantine section. The test is now opt-in only. Fix the root cause before removing quarantine."

Investigate and fix soon (Moderate):

"This test is intermittently unreliable. Root cause appears to be [cause]. Suggested fix: [specific fix based on cause classification]. Do not quarantine yet — fix the test directly."

Monitor (Low/suspected):

"This test shows suspected flakiness. Collect more run data before quarantining. Note it as 'suspected' in the regression suite."

针对每个Flaky测试：

隔离（高度不稳定）：

"立即隔离此测试。通过添加
@pytest.mark.skip
/
[Ignore]
/
GdUnitSkip
注解在CI中禁用它。将其记录在
tests/regression-suite.md
的隔离部分。该测试现在仅可选择启用。修复根本原因后再解除隔离。"

尽快调查并修复（中度）：

"此测试间歇性不可靠。根本原因似乎是[原因]。建议修复方案：[基于原因分类的具体修复方法]。暂不隔离——直接修复测试。"

监控（低度/疑似）：

"此测试存在疑似不稳定问题。在隔离前收集更多运行数据。在回归套件中将其标记为“疑似”。"

6. Generate Reports

6. 生成报告

In-conversation summary

对话内摘要

undefined

undefined

Flakiness Detection Results

不稳定测试检测结果

Runs analysed: [N] Tests tracked: [N]

分析的运行次数: [N] 跟踪的测试数量: [N]

Flaky Tests Found

发现的Flaky测试

Test	System	Fail Rate	Likely Cause	Recommendation
[test_name]	[system]	[N]%	Timing	Quarantine + fix async
[test_name]	[system]	[N]%	Float comparison	Fix: use epsilon compare
[test_name]	[system]	[N]%	Order dependency	Investigate teardown

测试名称	系统	失败率	可能原因	建议操作
[test_name]	[system]	[N]%	时序问题	隔离 + 修复异步逻辑
[test_name]	[system]	[N]%	浮点数比较	修复：使用epsilon比较
[test_name]	[system]	[N]%	顺序依赖	调查后置处理逻辑

Clean Tests (no flakiness detected)

稳定测试（未检测到不稳定）

[N] tests ran across [N] runs with consistent results — no flakiness detected.

[N]个测试在[N]次运行中结果一致——未检测到不稳定问题。

Data Limitations

数据局限性

[Note if fewer than 5 runs were available — fewer runs = less statistical confidence]

---

[如果可用运行次数少于5次，请注明——运行次数越少，统计置信度越低]

---

7. Update Regression Suite + Optional Report File

7. 更新回归套件 + 可选报告文件

Ask: "May I update the quarantine section of

tests/regression-suite.md

with the flaky tests found?"

If yes: use

Edit

to append entries to the Quarantined Tests table. Never remove existing quarantine entries — only add new ones.

Ask (separately): "May I write a full flakiness report to

production/qa/flakiness-report-[date].md

The full report includes per-test analysis with cause details and engine-specific fix snippets.

After writing:

For each quarantined test: "Add the engine-specific skip annotation to disable this test in CI. Re-enable after the root cause is fixed."
For fix-eligible tests: "The fix for [test] is straightforward — change the equality comparison on line [N] to use
```
is_equal_approx
```
."
Summary: "Once all quarantine annotations are applied, CI should run green. Schedule fix work for the [N] quarantined tests before the release gate."

询问：“是否允许我更新

tests/regression-suite.md

的隔离部分，添加发现的Flaky测试？”

如果同意：使用

Edit

操作将条目追加到隔离测试表格中。切勿删除现有隔离条目——仅添加新条目。

（单独）询问：“是否允许我将完整的不稳定测试报告写入

production/qa/flakiness-report-[date].md

文件？”

完整报告包含每个测试的分析、原因详情以及引擎特定的修复代码片段。

写入完成后：

对于每个隔离测试：“添加引擎特定的跳过注解以在CI中禁用此测试。修复根本原因后重新启用。”
对于可修复的测试：“[测试名称]的修复方法很简单——将第[N]行的相等比较改为使用
```
is_equal_approx
```
。”
总结：“应用所有隔离注解后，CI应能正常运行。在发布关口前安排[N]个隔离测试的修复工作。”

Collaborative Protocol

协作协议

Never delete test files — quarantine means annotate + list, not remove
Statistical confidence matters — with < 3 runs, flag findings as "suspected" not "confirmed"; ask if more run data is available
Fix is always the goal — quarantine is temporary; surface the fix direction even when recommending quarantine
Ask before writing — both the regression-suite update and the report file require explicit approval. On write: Verdict: COMPLETE — flakiness report written. On decline: Verdict: BLOCKED — user declined write.
Flakiness in CI is a team problem — surface the list and recommended actions clearly; do not just silently quarantine without the team knowing

切勿删除测试文件——隔离意味着添加注解并记录，而非删除
统计置信度至关重要——如果运行次数<3次，将结果标记为“疑似”而非“已确认”；询问是否有更多运行数据
修复始终是目标——隔离是临时措施；即使建议隔离，也要明确修复方向
写入前需询问——更新回归套件和写入报告文件都需要明确批准。写入成功时： verdict: COMPLETE —— 不稳定测试报告已写入。用户拒绝时： verdict: BLOCKED —— 用户拒绝写入操作。
CI中的不稳定测试是团队问题——清晰列出测试和建议操作；不要在团队不知情的情况下静默隔离测试