agent-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CodeGraph Quality Audit

CodeGraph质量审计

Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen codegraph version on a chosen real-world repo. Drives the harness in

scripts/agent-eval/

针对选定的CodeGraph版本与真实代码库，衡量CodeGraph相比普通grep/读取方式对Agent的帮助程度。该流程驱动

scripts/agent-eval/

中的测试工具。

Prerequisites

前置条件

```
tmux
```
3+, a logged-in
```
claude
```
CLI,
```
node
```
,
```
git
```
(macOS/Linux).
Run from the codegraph repo root.

```
tmux
```
3+版本、已登录的
```
claude
```
CLI、
```
node
```
、
```
git
```
（适用于macOS/Linux系统）。
需在codegraph仓库根目录下运行。

Workflow

工作流程

Copy this checklist:

- [ ] 1. Pick version (local or npm)
- [ ] 2. Pick language
- [ ] 3. Pick repo by size
- [ ] 4. Pick harness (headless / tmux / both)
- [ ] 5. Run audit.sh in the background
- [ ] 6. Report results

Step 1 — version. Ask with

AskUserQuestion

: which codegraph version to test. Offer "Local dev build" and "Latest published"; the free-text "Other" lets the user type a specific version (e.g.

0.7.10

). Map the answer to a VERSION token:

"Local dev build" →
```
local
```
"Latest published" →
```
latest
```
a typed version → that string (e.g.
```
0.7.10
```
)

Step 2 — language. Read

.claude/skills/agent-eval/corpus.json

. Ask with

AskUserQuestion

which language to test, listing the languages that have entries.

Step 3 — repo. From the chosen language's entries, ask which repo. Label each option with its size and file count, e.g.

excalidraw — Medium (~600 files)

. Each entry carries the

repo

URL and a representative

question

Step 4 — harness. Ask with

AskUserQuestion

which harness to run, and map the answer to a MODE token:

"Headless" →
```
headless
```
—
```
claude -p
```
with stream-json: exact tokens/cost and a clean tool sequence (2 runs, fast, no TTY).
"Interactive (tmux)" →
```
tmux
```
— drives the real Claude TUI in tmux: faithful Explore-subagent behavior, metrics from session logs (2 runs, slower).
"Both" →
```
all
```
— headless + interactive (4 runs).

Step 5 — run. Launch in the background (sets the version, clones if missing, wipes + re-indexes, runs the chosen arms — several minutes):

bash

scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>

Step 6 — report. When the job finishes, read the log and report per arm:

Headless (
```
parse-run.mjs
```
): total tool calls, file
```
Read
```
s, Grep/Bash, codegraph-tool calls, duration, total cost.

Interactive (

parse-session.mjs

): the

VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N

and

TOKENS:

lines.

Lead with cost + tool/Read counts — they are the reliable signals; raw token in/out are confounded by subagent delegation and prompt caching. State whether codegraph reduced effort and whether both arms reached a correct answer.

复制以下检查清单：

- [ ] 1. 选择版本（本地版或npm版）
- [ ] 2. 选择语言
- [ ] 3. 按大小选择代码库
- [ ] 4. 选择测试工具（无头模式/tmux/两者皆选）
- [ ] 5. 在后台运行audit.sh
- [ ] 6. 报告结果

步骤1 — 版本选择。通过

AskUserQuestion

询问用户要测试的CodeGraph版本。提供「本地开发构建版」和「最新发布版」选项；自由输入项「其他」允许用户指定具体版本（例如

0.7.10

）。将答案映射为VERSION标识：

「本地开发构建版」→
```
local
```
「最新发布版」→
```
latest
```
用户输入的具体版本 → 对应字符串（例如
```
0.7.10
```
）

步骤2 — 语言选择。读取

.claude/skills/agent-eval/corpus.json

文件。通过

AskUserQuestion

询问用户要测试的语言，列出文件中包含的语言选项。

步骤3 — 代码库选择。从选定语言的条目里，询问用户选择哪个代码库。每个选项标注其大小和文件数量，例如

excalidraw — 中等规模（约600个文件）

。每个条目包含

repo

地址和代表性的

question

。

步骤4 — 测试工具选择。通过

AskUserQuestion

询问用户要运行的测试工具，并将答案映射为MODE标识：

「无头模式」→
```
headless
```
— 使用
```
claude -p
```
配合stream-json：可获取精确的标识/成本数据，工具流程简洁（运行2次，速度快，无TTY）。
「交互式（tmux）」→
```
tmux
```
— 在tmux中驱动真实的Claude终端界面：还原真实的Explore子Agent行为，从会话日志中获取指标（运行2次，速度较慢）。
「两者皆选」→
```
all
```
— 无头模式+交互式模式（运行4次）。

步骤5 — 运行测试。在后台启动测试（设置版本，若代码库未克隆则进行克隆，清理并重新索引，运行选定的测试模式 — 需耗时数分钟）：

bash

scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>

步骤6 — 报告结果。任务完成后，读取日志并按测试模式分别报告：

无头模式（
```
parse-run.mjs
```
）：工具调用总数、文件
```
Read
```
次数、Grep/Bash使用次数、codegraph-tool调用次数、耗时、总成本。

交互式模式（

parse-session.mjs

）：

VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N

和

TOKENS:

行的内容。

优先展示成本和工具/Read调用次数 — 这些是可靠的指标；原始输入输出标识会被子Agent委托和提示缓存干扰。说明CodeGraph是否降低了工作量，以及两种模式是否都得出了正确答案。

Notes

注意事项

The index is rebuilt every run (
```
audit.sh
```
wipes
```
.codegraph
```
) — different versions extract differently, so an index must be served by the same binary that built it.
```
audit.sh
```
temporarily mutates the global
```
codegraph
```
install for the test, then restores your dev link via
```
local-install.sh
```
.
Corpus repos are cloned to
```
/tmp/codegraph-corpus
```
(reused if already present).
Add or edit repos in
```
corpus.json
```
(fields:
```
name
```
,
```
repo
```
,
```
size
```
,
```
files
```
,
```
question
```
).

每次运行都会重建索引（
```
audit.sh
```
会清理
```
.codegraph
```
）——不同版本的提取逻辑不同，因此索引必须由构建它的同一二进制文件提供服务。
```
audit.sh
```
会临时修改全局安装的
```
codegraph
```
用于测试，之后通过
```
local-install.sh
```
恢复你的开发链接。
语料库代码库会被克隆到
```
/tmp/codegraph-corpus
```
（若已存在则复用）。
可在
```
corpus.json
```
中添加或编辑代码库条目（字段：
```
name
```
、
```
repo
```
、
```
size
```
、
```
files
```
、
```
question
```
）。