agent-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCodeGraph Quality Audit
CodeGraph质量审计
Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen
codegraph version on a chosen real-world repo. Drives the harness in
.
scripts/agent-eval/针对选定的CodeGraph版本与真实代码库,衡量CodeGraph相比普通grep/读取方式对Agent的帮助程度。该流程驱动中的测试工具。
scripts/agent-eval/Prerequisites
前置条件
- 3+, a logged-in
tmuxCLI,claude,node(macOS/Linux).git - Run from the codegraph repo root.
- 3+版本、已登录的
tmuxCLI、claude、node(适用于macOS/Linux系统)。git - 需在codegraph仓库根目录下运行。
Workflow
工作流程
Copy this checklist:
- [ ] 1. Pick version (local or npm)
- [ ] 2. Pick language
- [ ] 3. Pick repo by size
- [ ] 4. Pick harness (headless / tmux / both)
- [ ] 5. Run audit.sh in the background
- [ ] 6. Report resultsStep 1 — version. Ask with : which codegraph version to test.
Offer "Local dev build" and "Latest published"; the free-text "Other" lets the
user type a specific version (e.g. ). Map the answer to a VERSION token:
AskUserQuestion0.7.10- "Local dev build" →
local - "Latest published" →
latest - a typed version → that string (e.g. )
0.7.10
Step 2 — language. Read . Ask with
which language to test, listing the languages that have entries.
.claude/skills/agent-eval/corpus.jsonAskUserQuestionStep 3 — repo. From the chosen language's entries, ask which repo. Label each
option with its size and file count, e.g. .
Each entry carries the URL and a representative .
excalidraw — Medium (~600 files)repoquestionStep 4 — harness. Ask with which harness to run, and map
the answer to a MODE token:
AskUserQuestion- "Headless" → —
headlesswith stream-json: exact tokens/cost and a clean tool sequence (2 runs, fast, no TTY).claude -p - "Interactive (tmux)" → — drives the real Claude TUI in tmux: faithful Explore-subagent behavior, metrics from session logs (2 runs, slower).
tmux - "Both" → — headless + interactive (4 runs).
all
Step 5 — run. Launch in the background (sets the version, clones if missing,
wipes + re-indexes, runs the chosen arms — several minutes):
bash
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>Step 6 — report. When the job finishes, read the log and report per arm:
- Headless (): total tool calls, file
parse-run.mjss, Grep/Bash, codegraph-tool calls, duration, total cost.Read - Interactive (): the
parse-session.mjsandVERDICT: codegraph_explore used Nx | Read N | Grep/Bash Nlines.TOKENS:
Lead with cost + tool/Read counts — they are the reliable signals; raw token
in/out are confounded by subagent delegation and prompt caching. State whether
codegraph reduced effort and whether both arms reached a correct answer.
复制以下检查清单:
- [ ] 1. 选择版本(本地版或npm版)
- [ ] 2. 选择语言
- [ ] 3. 按大小选择代码库
- [ ] 4. 选择测试工具(无头模式/tmux/两者皆选)
- [ ] 5. 在后台运行audit.sh
- [ ] 6. 报告结果步骤1 — 版本选择。通过询问用户要测试的CodeGraph版本。提供「本地开发构建版」和「最新发布版」选项;自由输入项「其他」允许用户指定具体版本(例如)。将答案映射为VERSION标识:
AskUserQuestion0.7.10- 「本地开发构建版」→
local - 「最新发布版」→
latest - 用户输入的具体版本 → 对应字符串(例如)
0.7.10
步骤2 — 语言选择。读取文件。通过询问用户要测试的语言,列出文件中包含的语言选项。
.claude/skills/agent-eval/corpus.jsonAskUserQuestion步骤3 — 代码库选择。从选定语言的条目里,询问用户选择哪个代码库。每个选项标注其大小和文件数量,例如。每个条目包含地址和代表性的。
excalidraw — 中等规模(约600个文件)repoquestion步骤4 — 测试工具选择。通过询问用户要运行的测试工具,并将答案映射为MODE标识:
AskUserQuestion- 「无头模式」→ — 使用
headless配合stream-json:可获取精确的标识/成本数据,工具流程简洁(运行2次,速度快,无TTY)。claude -p - 「交互式(tmux)」→ — 在tmux中驱动真实的Claude终端界面:还原真实的Explore子Agent行为,从会话日志中获取指标(运行2次,速度较慢)。
tmux - 「两者皆选」→ — 无头模式+交互式模式(运行4次)。
all
步骤5 — 运行测试。在后台启动测试(设置版本,若代码库未克隆则进行克隆,清理并重新索引,运行选定的测试模式 — 需耗时数分钟):
bash
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>步骤6 — 报告结果。任务完成后,读取日志并按测试模式分别报告:
- 无头模式():工具调用总数、文件
parse-run.mjs次数、Grep/Bash使用次数、codegraph-tool调用次数、耗时、总成本。Read - 交互式模式():
parse-session.mjs和VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N行的内容。TOKENS:
优先展示成本和工具/Read调用次数 — 这些是可靠的指标;原始输入输出标识会被子Agent委托和提示缓存干扰。说明CodeGraph是否降低了工作量,以及两种模式是否都得出了正确答案。
Notes
注意事项
- The index is rebuilt every run (wipes
audit.sh) — different versions extract differently, so an index must be served by the same binary that built it..codegraph - temporarily mutates the global
audit.shinstall for the test, then restores your dev link viacodegraph.local-install.sh - Corpus repos are cloned to (reused if already present).
/tmp/codegraph-corpus - Add or edit repos in (fields:
corpus.json,name,repo,size,files).question
- 每次运行都会重建索引(会清理
audit.sh)——不同版本的提取逻辑不同,因此索引必须由构建它的同一二进制文件提供服务。.codegraph - 会临时修改全局安装的
audit.sh用于测试,之后通过codegraph恢复你的开发链接。local-install.sh - 语料库代码库会被克隆到(若已存在则复用)。
/tmp/codegraph-corpus - 可在中添加或编辑代码库条目(字段:
corpus.json、name、repo、size、files)。question