openclaw-test-heap-leaks

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenClaw Test Heap Leaks

OpenClaw测试堆泄漏问题

Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call.

For runtime fixes (e.g., closure leaks in long-running services like the gateway), see Validating runtime fixes below — that uses a dedicated harness, not the test-parallel snapshot machinery.

本技能用于测试内存问题排查。当有堆快照可用时，请勿仅通过RSS进行猜测。在retainers（保留链）或dominators（支配树）验证之前，仅将快照名称差异作为初步排查依据，而非定论。

针对运行时修复（例如网关等长期运行服务中的闭包泄漏），请参阅下方的验证运行时修复部分——该部分使用专用测试工具，而非并行测试快照机制。

Workflow

工作流程

Reproduce the failing shape first.
- Match the real entrypoint if possible. For Linux CI-style unit failures, start with:
- ```
pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test
```
- Keep
```
OPENCLAW_TEST_MEMORY_TRACE=1
```
  enabled so the wrapper prints per-file RSS summaries alongside the snapshots.
- If the report is about a specific shard or worker budget, preserve that shape.
- Before you analyze snapshots, identify the real lane names from
```
[test-parallel] start ...
```
  lines or
```
pnpm test --plan
```
  . Do not assume a single
```
unit-fast
```
  lane; local plans often split into
```
unit-fast-batch-*
```
  .
Wait for repeated snapshots before concluding anything.
- Take at least two intervals from the same lane.
- Compare snapshots from the same PID inside the real lane directory such as
```
.tmp/heapsnap/unit-fast-batch-2/
```
  .
- Use
```
.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
```
  to compare either two files directly or the earliest/latest pair per PID in one lane directory.
- If the helper suggests transformed-module retention, confirm the top entries in DevTools retainers/dominators before calling it solved.
Classify the growth before choosing a fix.
- If growth is dominated by Vite/Vitest transformed source strings,
```
Module
```
  ,
```
system / Context
```
  , bytecode, descriptor arrays, or property maps, treat it as likely retained module graph growth in long-lived workers.
- If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak.
- If the names are ambiguous, stop short of a confident label and inspect retainers/dominators in DevTools for the top deltas.
Fix the right layer.
- For likely retained transformed-module growth in shared workers:
- Prefer timing and hotspot-driven scheduling fixes first. Check whether the file is already represented in
```
test/fixtures/test-timings.unit.json
```
  and whether
```
scripts/test-update-memory-hotspots.mjs
```
  should refresh the measured hotspot manifest before hand-editing behavior overrides.
- Move hotspot files out of the real shared lane by updating
```
test/fixtures/test-parallel.behavior.json
```
  only when timing-driven peeling is insufficient.
- Prefer
```
singletonIsolated
```
  for files that are safe alone but inflate shared worker heaps.
- If the file should already have been peeled out by timings but is absent from
```
test/fixtures/test-timings.unit.json
```
  , call that out explicitly. Missing timings are a scheduling blind spot.
- For real leaks:
- Patch the implicated test or runtime cleanup path.
- Look for missing
```
afterEach
```
  /
```
afterAll
```
  , module-reset gaps, retained global state, unreleased DB handles, or listeners/timers that survive the file.
Verify with the most direct proof.
- Re-run the targeted lane or file with heap snapshots enabled if the suite still finishes in reasonable time.
- If snapshot overhead pushes tests over Vitest timeouts, fall back to the same lane without snapshots and confirm the RSS trend or OOM is reduced.
- For wrapper-only changes, at minimum verify the expected lanes start and the snapshot files are written.

首先复现故障场景。
- 尽可能匹配实际入口点。对于Linux CI风格的单元测试失败，请从以下命令开始：
- ```
pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test
```
- 保持
```
OPENCLAW_TEST_MEMORY_TRACE=1
```
  开启，以便包装器在快照旁打印每个文件的RSS摘要。
- 如果报告涉及特定分片或工作进程资源限制，请保留该配置。
- 在分析快照之前，从
```
[test-parallel] start ...
```
  日志行或
```
pnpm test --plan
```
  命令输出中确定真实的lane名称。不要假设只有
```
unit-fast
```
  这一个lane；本地测试计划通常会拆分为
```
unit-fast-batch-*
```
  多个批次。
在得出结论前，需获取多组重复快照。
- 从同一个lane中至少获取两个时间间隔的快照。
- 对比同一PID在真实lane目录（如
```
.tmp/heapsnap/unit-fast-batch-2/
```
  ）下的快照。
- 使用
```
.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs
```
  脚本直接对比两个文件，或对比单个lane目录下每个PID的最早/最新快照对。
- 如果工具提示存在转换模块保留问题，请先在DevTools中确认顶级保留链/支配树条目，再判定问题已解决。
在选择修复方案前，先对内存增长类型进行分类。
- 如果内存增长主要来自Vite/Vitest转换后的源码字符串、
```
Module
```
  、
```
system / Context
```
  、字节码、描述符数组或属性映射，则视为长期运行工作进程中可能存在的模块图保留增长问题。
- 如果内存增长主要来自应用对象、缓存、缓冲区、服务器句柄、定时器、模拟状态、sqlite状态或类似运行时对象，则视为可能存在清理或生命周期泄漏问题。
- 如果对象名称不明确，请不要急于下定论，先在DevTools中检查顶级差异项的保留链/支配树。
针对正确的层级进行修复。
- 对于共享工作进程中可能存在的转换模块保留增长问题：
  - 优先选择基于时序和热点的调度修复方案。检查文件是否已在
```
test/fixtures/test-timings.unit.json
```
    中记录，以及是否需要先通过
```
scripts/test-update-memory-hotspots.mjs
```
    刷新测量的热点清单，再手动编辑行为覆盖规则。
  - 仅当时序驱动的剥离方案不足以解决问题时，才通过更新
```
test/fixtures/test-parallel.behavior.json
```
    将热点文件移出共享lane。
  - 对于单独运行安全但会膨胀共享工作进程堆内存的文件，优先使用
```
singletonIsolated
```
    模式。
  - 如果根据时序应该已被剥离的文件未出现在
```
test/fixtures/test-timings.unit.json
```
    中，请明确指出该问题。缺失时序记录属于调度盲区。
- 对于真实泄漏问题：
  - 修复相关测试或运行时清理路径。
  - 检查是否缺失
```
afterEach
```
    /
```
afterAll
```
    钩子、模块重置漏洞、保留的全局状态、未释放的数据库句柄，或在文件执行后仍存在的监听器/定时器。
用最直接的证据验证修复效果。
- 如果测试套件仍能在合理时间内完成，请重新运行目标lane或文件并启用堆快照。
- 如果快照开销导致测试超出Vitest超时时间，则退回到不启用快照的同一lane，确认RSS趋势或内存溢出问题已缓解。
- 对于仅修改包装器的变更，至少要验证预期lane已启动且快照文件已生成。

Heuristics

启发式规则

Do not call everything a leak. In this repo, large
```
unit-fast
```
or
```
unit-fast-batch-*
```
growth can be a worker-lifetime problem rather than an application object leak.
```
scripts/test-parallel.mjs
```
and
```
scripts/test-parallel-memory.mjs
```
are the primary control points for wrapper diagnostics.

The lane names printed by

[test-parallel] start ...

and

[test-parallel][mem] summary ...

tell you where to focus.

When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix.
When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition, then confirm ambiguous calls with retainer evidence.

不要将所有问题都称为泄漏。在本仓库中，
```
unit-fast
```
或
```
unit-fast-batch-*
```
的内存大幅增长可能是工作进程生命周期问题，而非应用对象泄漏。

scripts/test-parallel.mjs

和

scripts/test-parallel-memory.mjs

是包装器诊断的主要控制点。

```
[test-parallel] start ...
```
和
```
[test-parallel][mem] summary ...
```
日志行中打印的lane名称可指引你关注重点。
如果少数几个文件导致大部分内存差异，且这些文件未被记录在时序中，那么通过隔离它们来降低影响通常是最务实的首要修复方案。
当同一保留对象族在同一工作进程PID的多个时间间隔内持续增长时，优先信任快照而非直觉，然后通过保留链证据验证模糊的判断。

Snapshot Comparison

快照对比

Direct comparison:

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot

Auto-select earliest/latest snapshots per PID within one lane:

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2

Useful flags:
- ```
--top 40
```
- ```
--min-kb 32
```
- ```
--pid 16133
```

Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause.

直接对比：

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot

自动选择单个lane内每个PID的最早/最新快照：

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2

实用参数：
- ```
--top 40
```
- ```
--min-kb 32
```
- ```
--pid 16133
```

优先查看顶级正增长差异。模块转换产物的大幅正增长表明需要lane隔离；运行时对象的大幅正增长表明存在真实泄漏。如果仅通过名称无法确定原因，请在DevTools中打开同一快照对，检查顶级行的保留链/支配树后再确定根本原因。

Validating runtime fixes (not test-memory)

验证运行时修复（非测试内存问题）

The workflow above is for diagnosing Vitest worker memory growth. For validating that a runtime/closure fix actually releases captured state, use the dedicated harness:

```
pnpm leak:embedded-run
```
— runs
```
scripts/embedded-run-abort-leak.ts
```
. Loops N aborted runs in a function-shaped scope mimicking
```
runEmbeddedAttempt
```
, writes heap snapshots, and reports a PASS/FAIL verdict on retention growth using
```
FinalizationRegistry
```
for tracked-instance counting plus RSS delta.

Modes:

```
closure-extracted
```
(default) — production fix shape (helper at module scope).
```
closure-inline
```
— pre-fix shape (closure inside the runner scope). Use as a sensitivity check: if it passes you've broken the harness, not fixed a bug.
```
synthetic-leak
```
— deliberately retains via a module-level bucket. Use to confirm the harness can detect leaks before trusting a PASS on a real fix.

Snapshots land in

.tmp/embedded-run-abort-leak/

. Diff with the same script as above:

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs \
  .tmp/embedded-run-abort-leak/baseline-*.heapsnapshot \
  .tmp/embedded-run-abort-leak/batch-N-*.heapsnapshot --top 30

When fixing a different runtime leak, add a new harness alongside this one rather than retrofitting it. The fixture function should mimic the lexical scope of the function where the leak lives, not be a generic abort-loop.

上述工作流程用于诊断Vitest工作进程内存增长问题。若要验证运行时/闭包修复是否真正释放了捕获的状态，请使用专用测试工具：

```
pnpm leak:embedded-run
```
—— 运行
```
scripts/embedded-run-abort-leak.ts
```
。在模拟
```
runEmbeddedAttempt
```
的函数作用域内循环执行N次中断运行，生成堆快照，并通过
```
FinalizationRegistry
```
跟踪实例计数加上RSS差异来报告保留增长的PASS/FAIL结果。

模式说明：

```
closure-extracted
```
（默认）—— 生产环境修复形态（辅助函数位于模块作用域）。
```
closure-inline
```
—— 修复前形态（闭包位于运行器作用域）。用作敏感度检查：如果该模式通过，则说明测试工具已失效，而非问题已修复。
```
synthetic-leak
```
—— 通过模块级存储桶故意保留对象。用于在信任真实修复的PASS结果前，确认测试工具能够检测到泄漏。

快照将保存到

.tmp/embedded-run-abort-leak/

目录下。使用上述相同脚本进行对比：

node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs \
  .tmp/embedded-run-abort-leak/baseline-*.heapsnapshot \
  .tmp/embedded-run-abort-leak/batch-N-*.heapsnapshot --top 30

修复其他类型的运行时泄漏时，请在现有工具旁添加新的测试工具，而非修改现有工具。测试函数应模拟泄漏所在函数的词法作用域，而非使用通用的中断循环。

Output Expectations

输出要求

When using this skill, report:

The exact reproduce command.
Which lane and PID were compared.
The dominant retained object families from the snapshot delta.
Whether the issue is a likely real leak or likely shared-worker retained module growth, plus whether retainers/dominators confirmed it.
The concrete fix or impact-reduction patch.
What you verified, and what snapshot overhead prevented you from verifying.

使用本技能时，请报告以下内容：

精确的复现命令。
对比的lane和PID。
快照差异中占主导的保留对象类别。
问题是真实泄漏还是共享工作进程模块保留增长，以及是否已通过保留链/支配树验证。
具体的修复方案或降低影响的补丁。
已验证的内容，以及因快照开销无法验证的内容。