openclaw-test-heap-leaks
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenClaw Test Heap Leaks
OpenClaw测试堆泄漏问题
Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call.
For runtime fixes (e.g., closure leaks in long-running services like the gateway), see Validating runtime fixes below — that uses a dedicated harness, not the test-parallel snapshot machinery.
本技能用于测试内存问题排查。当有堆快照可用时,请勿仅通过RSS进行猜测。在retainers(保留链)或dominators(支配树)验证之前,仅将快照名称差异作为初步排查依据,而非定论。
针对运行时修复(例如网关等长期运行服务中的闭包泄漏),请参阅下方的验证运行时修复部分——该部分使用专用测试工具,而非并行测试快照机制。
Workflow
工作流程
-
Reproduce the failing shape first.
- Match the real entrypoint if possible. For Linux CI-style unit failures, start with:
pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test- Keep enabled so the wrapper prints per-file RSS summaries alongside the snapshots.
OPENCLAW_TEST_MEMORY_TRACE=1 - If the report is about a specific shard or worker budget, preserve that shape.
- Before you analyze snapshots, identify the real lane names from lines or
[test-parallel] start .... Do not assume a singlepnpm test --planlane; local plans often split intounit-fast.unit-fast-batch-*
-
Wait for repeated snapshots before concluding anything.
- Take at least two intervals from the same lane.
- Compare snapshots from the same PID inside the real lane directory such as .
.tmp/heapsnap/unit-fast-batch-2/ - Use to compare either two files directly or the earliest/latest pair per PID in one lane directory.
.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs - If the helper suggests transformed-module retention, confirm the top entries in DevTools retainers/dominators before calling it solved.
-
Classify the growth before choosing a fix.
- If growth is dominated by Vite/Vitest transformed source strings, ,
Module, bytecode, descriptor arrays, or property maps, treat it as likely retained module graph growth in long-lived workers.system / Context - If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak.
- If the names are ambiguous, stop short of a confident label and inspect retainers/dominators in DevTools for the top deltas.
- If growth is dominated by Vite/Vitest transformed source strings,
-
Fix the right layer.
- For likely retained transformed-module growth in shared workers:
- Prefer timing and hotspot-driven scheduling fixes first. Check whether the file is already represented in and whether
test/fixtures/test-timings.unit.jsonshould refresh the measured hotspot manifest before hand-editing behavior overrides.scripts/test-update-memory-hotspots.mjs - Move hotspot files out of the real shared lane by updating only when timing-driven peeling is insufficient.
test/fixtures/test-parallel.behavior.json - Prefer for files that are safe alone but inflate shared worker heaps.
singletonIsolated - If the file should already have been peeled out by timings but is absent from , call that out explicitly. Missing timings are a scheduling blind spot.
test/fixtures/test-timings.unit.json - For real leaks:
- Patch the implicated test or runtime cleanup path.
- Look for missing /
afterEach, module-reset gaps, retained global state, unreleased DB handles, or listeners/timers that survive the file.afterAll
-
Verify with the most direct proof.
- Re-run the targeted lane or file with heap snapshots enabled if the suite still finishes in reasonable time.
- If snapshot overhead pushes tests over Vitest timeouts, fall back to the same lane without snapshots and confirm the RSS trend or OOM is reduced.
- For wrapper-only changes, at minimum verify the expected lanes start and the snapshot files are written.
-
首先复现故障场景。
- 尽可能匹配实际入口点。对于Linux CI风格的单元测试失败,请从以下命令开始:
pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test- 保持开启,以便包装器在快照旁打印每个文件的RSS摘要。
OPENCLAW_TEST_MEMORY_TRACE=1 - 如果报告涉及特定分片或工作进程资源限制,请保留该配置。
- 在分析快照之前,从日志行或
[test-parallel] start ...命令输出中确定真实的lane名称。不要假设只有pnpm test --plan这一个lane;本地测试计划通常会拆分为unit-fast多个批次。unit-fast-batch-*
-
在得出结论前,需获取多组重复快照。
- 从同一个lane中至少获取两个时间间隔的快照。
- 对比同一PID在真实lane目录(如)下的快照。
.tmp/heapsnap/unit-fast-batch-2/ - 使用脚本直接对比两个文件,或对比单个lane目录下每个PID的最早/最新快照对。
.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs - 如果工具提示存在转换模块保留问题,请先在DevTools中确认顶级保留链/支配树条目,再判定问题已解决。
-
在选择修复方案前,先对内存增长类型进行分类。
- 如果内存增长主要来自Vite/Vitest转换后的源码字符串、、
Module、字节码、描述符数组或属性映射,则视为长期运行工作进程中可能存在的模块图保留增长问题。system / Context - 如果内存增长主要来自应用对象、缓存、缓冲区、服务器句柄、定时器、模拟状态、sqlite状态或类似运行时对象,则视为可能存在清理或生命周期泄漏问题。
- 如果对象名称不明确,请不要急于下定论,先在DevTools中检查顶级差异项的保留链/支配树。
- 如果内存增长主要来自Vite/Vitest转换后的源码字符串、
-
针对正确的层级进行修复。
- 对于共享工作进程中可能存在的转换模块保留增长问题:
- 优先选择基于时序和热点的调度修复方案。检查文件是否已在中记录,以及是否需要先通过
test/fixtures/test-timings.unit.json刷新测量的热点清单,再手动编辑行为覆盖规则。scripts/test-update-memory-hotspots.mjs - 仅当时序驱动的剥离方案不足以解决问题时,才通过更新将热点文件移出共享lane。
test/fixtures/test-parallel.behavior.json - 对于单独运行安全但会膨胀共享工作进程堆内存的文件,优先使用模式。
singletonIsolated - 如果根据时序应该已被剥离的文件未出现在中,请明确指出该问题。缺失时序记录属于调度盲区。
test/fixtures/test-timings.unit.json
- 优先选择基于时序和热点的调度修复方案。检查文件是否已在
- 对于真实泄漏问题:
- 修复相关测试或运行时清理路径。
- 检查是否缺失/
afterEach钩子、模块重置漏洞、保留的全局状态、未释放的数据库句柄,或在文件执行后仍存在的监听器/定时器。afterAll
- 对于共享工作进程中可能存在的转换模块保留增长问题:
-
用最直接的证据验证修复效果。
- 如果测试套件仍能在合理时间内完成,请重新运行目标lane或文件并启用堆快照。
- 如果快照开销导致测试超出Vitest超时时间,则退回到不启用快照的同一lane,确认RSS趋势或内存溢出问题已缓解。
- 对于仅修改包装器的变更,至少要验证预期lane已启动且快照文件已生成。
Heuristics
启发式规则
- Do not call everything a leak. In this repo, large or
unit-fastgrowth can be a worker-lifetime problem rather than an application object leak.unit-fast-batch-* - and
scripts/test-parallel.mjsare the primary control points for wrapper diagnostics.scripts/test-parallel-memory.mjs - The lane names printed by and
[test-parallel] start ...tell you where to focus.[test-parallel][mem] summary ... - When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix.
- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition, then confirm ambiguous calls with retainer evidence.
- 不要将所有问题都称为泄漏。在本仓库中,或
unit-fast的内存大幅增长可能是工作进程生命周期问题,而非应用对象泄漏。unit-fast-batch-* - 和
scripts/test-parallel.mjs是包装器诊断的主要控制点。scripts/test-parallel-memory.mjs - 和
[test-parallel] start ...日志行中打印的lane名称可指引你关注重点。[test-parallel][mem] summary ... - 如果少数几个文件导致大部分内存差异,且这些文件未被记录在时序中,那么通过隔离它们来降低影响通常是最务实的首要修复方案。
- 当同一保留对象族在同一工作进程PID的多个时间间隔内持续增长时,优先信任快照而非直觉,然后通过保留链证据验证模糊的判断。
Snapshot Comparison
快照对比
- Direct comparison:
node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot
- Auto-select earliest/latest snapshots per PID within one lane:
node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2
- Useful flags:
--top 40--min-kb 32--pid 16133
Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause.
- 直接对比:
node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot
- 自动选择单个lane内每个PID的最早/最新快照:
node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2
- 实用参数:
--top 40--min-kb 32--pid 16133
优先查看顶级正增长差异。模块转换产物的大幅正增长表明需要lane隔离;运行时对象的大幅正增长表明存在真实泄漏。如果仅通过名称无法确定原因,请在DevTools中打开同一快照对,检查顶级行的保留链/支配树后再确定根本原因。
Validating runtime fixes (not test-memory)
验证运行时修复(非测试内存问题)
The workflow above is for diagnosing Vitest worker memory growth. For
validating that a runtime/closure fix actually releases captured state, use the
dedicated harness:
- — runs
pnpm leak:embedded-run. Loops N aborted runs in a function-shaped scope mimickingscripts/embedded-run-abort-leak.ts, writes heap snapshots, and reports a PASS/FAIL verdict on retention growth usingrunEmbeddedAttemptfor tracked-instance counting plus RSS delta.FinalizationRegistry
Modes:
- (default) — production fix shape (helper at module scope).
closure-extracted - — pre-fix shape (closure inside the runner scope). Use as a sensitivity check: if it passes you've broken the harness, not fixed a bug.
closure-inline - — deliberately retains via a module-level bucket. Use to confirm the harness can detect leaks before trusting a PASS on a real fix.
synthetic-leak
Snapshots land in . Diff with the same script
as above:
.tmp/embedded-run-abort-leak/node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs \
.tmp/embedded-run-abort-leak/baseline-*.heapsnapshot \
.tmp/embedded-run-abort-leak/batch-N-*.heapsnapshot --top 30When fixing a different runtime leak, add a new harness alongside this one
rather than retrofitting it. The fixture function should mimic the lexical
scope of the function where the leak lives, not be a generic abort-loop.
上述工作流程用于诊断Vitest工作进程内存增长问题。若要验证运行时/闭包修复是否真正释放了捕获的状态,请使用专用测试工具:
- —— 运行
pnpm leak:embedded-run。在模拟scripts/embedded-run-abort-leak.ts的函数作用域内循环执行N次中断运行,生成堆快照,并通过runEmbeddedAttempt跟踪实例计数加上RSS差异来报告保留增长的PASS/FAIL结果。FinalizationRegistry
模式说明:
- (默认)—— 生产环境修复形态(辅助函数位于模块作用域)。
closure-extracted - —— 修复前形态(闭包位于运行器作用域)。用作敏感度检查:如果该模式通过,则说明测试工具已失效,而非问题已修复。
closure-inline - —— 通过模块级存储桶故意保留对象。用于在信任真实修复的PASS结果前,确认测试工具能够检测到泄漏。
synthetic-leak
快照将保存到目录下。使用上述相同脚本进行对比:
.tmp/embedded-run-abort-leak/node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs \
.tmp/embedded-run-abort-leak/baseline-*.heapsnapshot \
.tmp/embedded-run-abort-leak/batch-N-*.heapsnapshot --top 30修复其他类型的运行时泄漏时,请在现有工具旁添加新的测试工具,而非修改现有工具。测试函数应模拟泄漏所在函数的词法作用域,而非使用通用的中断循环。
Output Expectations
输出要求
When using this skill, report:
- The exact reproduce command.
- Which lane and PID were compared.
- The dominant retained object families from the snapshot delta.
- Whether the issue is a likely real leak or likely shared-worker retained module growth, plus whether retainers/dominators confirmed it.
- The concrete fix or impact-reduction patch.
- What you verified, and what snapshot overhead prevented you from verifying.
使用本技能时,请报告以下内容:
- 精确的复现命令。
- 对比的lane和PID。
- 快照差异中占主导的保留对象类别。
- 问题是真实泄漏还是共享工作进程模块保留增长,以及是否已通过保留链/支配树验证。
- 具体的修复方案或降低影响的补丁。
- 已验证的内容,以及因快照开销无法验证的内容。