agent-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CodeGraph Quality Audit

CodeGraph质量审计

Measures how much CodeGraph helps an agent versus plain grep/read, for a chosen codegraph version on a chosen real-world repo. Drives the harness in
scripts/agent-eval/
.
针对选定的CodeGraph版本与真实代码库,衡量CodeGraph相比普通grep/读取方式对Agent的帮助程度。该流程驱动
scripts/agent-eval/
中的测试工具。

Prerequisites

前置条件

  • tmux
    3+, a logged-in
    claude
    CLI,
    node
    ,
    git
    (macOS/Linux).
  • Run from the codegraph repo root.
  • tmux
    3+版本、已登录的
    claude
    CLI、
    node
    git
    (适用于macOS/Linux系统)。
  • 需在codegraph仓库根目录下运行。

Workflow

工作流程

Copy this checklist:
- [ ] 1. Pick version (local or npm)
- [ ] 2. Pick language
- [ ] 3. Pick repo by size
- [ ] 4. Pick harness (headless / tmux / both)
- [ ] 5. Run audit.sh in the background
- [ ] 6. Report results
Step 1 — version. Ask with
AskUserQuestion
: which codegraph version to test. Offer "Local dev build" and "Latest published"; the free-text "Other" lets the user type a specific version (e.g.
0.7.10
). Map the answer to a VERSION token:
  • "Local dev build" →
    local
  • "Latest published" →
    latest
  • a typed version → that string (e.g.
    0.7.10
    )
Step 2 — language. Read
.claude/skills/agent-eval/corpus.json
. Ask with
AskUserQuestion
which language to test, listing the languages that have entries.
Step 3 — repo. From the chosen language's entries, ask which repo. Label each option with its size and file count, e.g.
excalidraw — Medium (~600 files)
. Each entry carries the
repo
URL and a representative
question
.
Step 4 — harness. Ask with
AskUserQuestion
which harness to run, and map the answer to a MODE token:
  • "Headless" →
    headless
    claude -p
    with stream-json: exact tokens/cost and a clean tool sequence (2 runs, fast, no TTY).
  • "Interactive (tmux)" →
    tmux
    — drives the real Claude TUI in tmux: faithful Explore-subagent behavior, metrics from session logs (2 runs, slower).
  • "Both" →
    all
    — headless + interactive (4 runs).
Step 5 — run. Launch in the background (sets the version, clones if missing, wipes + re-indexes, runs the chosen arms — several minutes):
bash
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>
Step 6 — report. When the job finishes, read the log and report per arm:
  • Headless (
    parse-run.mjs
    ): total tool calls, file
    Read
    s, Grep/Bash, codegraph-tool calls, duration, total cost.
  • Interactive (
    parse-session.mjs
    ): the
    VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N
    and
    TOKENS:
    lines.
Lead with cost + tool/Read counts — they are the reliable signals; raw token in/out are confounded by subagent delegation and prompt caching. State whether codegraph reduced effort and whether both arms reached a correct answer.
复制以下检查清单:
- [ ] 1. 选择版本(本地版或npm版)
- [ ] 2. 选择语言
- [ ] 3. 按大小选择代码库
- [ ] 4. 选择测试工具(无头模式/tmux/两者皆选)
- [ ] 5. 在后台运行audit.sh
- [ ] 6. 报告结果
步骤1 — 版本选择。通过
AskUserQuestion
询问用户要测试的CodeGraph版本。提供「本地开发构建版」和「最新发布版」选项;自由输入项「其他」允许用户指定具体版本(例如
0.7.10
)。将答案映射为VERSION标识:
  • 「本地开发构建版」→
    local
  • 「最新发布版」→
    latest
  • 用户输入的具体版本 → 对应字符串(例如
    0.7.10
步骤2 — 语言选择。读取
.claude/skills/agent-eval/corpus.json
文件。通过
AskUserQuestion
询问用户要测试的语言,列出文件中包含的语言选项。
步骤3 — 代码库选择。从选定语言的条目里,询问用户选择哪个代码库。每个选项标注其大小和文件数量,例如
excalidraw — 中等规模(约600个文件)
。每个条目包含
repo
地址和代表性的
question
步骤4 — 测试工具选择。通过
AskUserQuestion
询问用户要运行的测试工具,并将答案映射为MODE标识:
  • 「无头模式」→
    headless
    — 使用
    claude -p
    配合stream-json:可获取精确的标识/成本数据,工具流程简洁(运行2次,速度快,无TTY)。
  • 「交互式(tmux)」→
    tmux
    — 在tmux中驱动真实的Claude终端界面:还原真实的Explore子Agent行为,从会话日志中获取指标(运行2次,速度较慢)。
  • 「两者皆选」→
    all
    — 无头模式+交互式模式(运行4次)。
步骤5 — 运行测试。在后台启动测试(设置版本,若代码库未克隆则进行克隆,清理并重新索引,运行选定的测试模式 — 需耗时数分钟):
bash
scripts/agent-eval/audit.sh <VERSION> <repo-name> <repo-url> "<question>" <MODE>
步骤6 — 报告结果。任务完成后,读取日志并按测试模式分别报告:
  • 无头模式(
    parse-run.mjs
    ):工具调用总数、文件
    Read
    次数、Grep/Bash使用次数、codegraph-tool调用次数、耗时、总成本
  • 交互式模式(
    parse-session.mjs
    ):
    VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N
    TOKENS:
    行的内容。
优先展示成本和工具/Read调用次数 — 这些是可靠的指标;原始输入输出标识会被子Agent委托和提示缓存干扰。说明CodeGraph是否降低了工作量,以及两种模式是否都得出了正确答案。

Notes

注意事项

  • The index is rebuilt every run (
    audit.sh
    wipes
    .codegraph
    ) — different versions extract differently, so an index must be served by the same binary that built it.
  • audit.sh
    temporarily mutates the global
    codegraph
    install for the test, then restores your dev link via
    local-install.sh
    .
  • Corpus repos are cloned to
    /tmp/codegraph-corpus
    (reused if already present).
  • Add or edit repos in
    corpus.json
    (fields:
    name
    ,
    repo
    ,
    size
    ,
    files
    ,
    question
    ).
  • 每次运行都会重建索引(
    audit.sh
    会清理
    .codegraph
    )——不同版本的提取逻辑不同,因此索引必须由构建它的同一二进制文件提供服务。
  • audit.sh
    会临时修改全局安装的
    codegraph
    用于测试,之后通过
    local-install.sh
    恢复你的开发链接。
  • 语料库代码库会被克隆到
    /tmp/codegraph-corpus
    (若已存在则复用)。
  • 可在
    corpus.json
    中添加或编辑代码库条目(字段:
    name
    repo
    size
    files
    question
    )。