add-lang

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Add a language to CodeGraph

为CodeGraph添加新语言

Wire a new tree-sitter language into codegraph's extraction pipeline, prove it extracts real symbols on popular repos, and prove it beats no-codegraph for an agent. Runs fully autonomously — pick repos, benchmark, update docs, then report. Never commit, push, publish, or tag (house rule); leave all changes for the user to review.
The argument is the language token used throughout the
Language
union, e.g.
lua
,
elixir
,
zig
. If none was given, ask which language. Use the lowercase single-token form everywhere (
csharp
, not
c#
).
将新的tree-sitter语言接入CodeGraph的提取流程,验证其能在热门仓库中提取真实符号,并证明其在Agent场景下优于无CodeGraph的方案。全程完全自主运行——选择仓库、执行基准测试、更新文档,然后提交报告。禁止提交、推送、发布或打标签(内部规则);所有修改需留待用户审核。
参数为
Language
联合类型中使用的语言标识,例如
lua
elixir
zig
。若未指定语言,请询问用户。所有场景下均使用小写单标识形式(如
csharp
,而非
c#
)。

Prerequisites

前置条件

  • Run from the codegraph repo root.
    node
    ,
    git
    ,
    gh
    , and a logged-in
    claude
    CLI (the benchmark spawns real
    claude -p
    runs).
  • The benchmark uses the local dev build — Step 8 builds + links it on PATH.
  • 需在CodeGraph仓库根目录下运行。需安装
    node
    git
    gh
    ,并登录
    claude
    CLI(基准测试会启动真实的
    claude -p
    运行实例)。
  • 基准测试使用本地开发构建版本——步骤8会将其构建并链接至PATH。

Workflow

工作流程

Copy this checklist and work through it in order:
- [ ] 1. Resolve language; bail early if already supported (just benchmark)
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
- [ ] 5. Build + verify-extraction loop until PASS
- [ ] 6. Add extraction tests; make them green
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
- [ ] 8. Benchmark all 3: extraction + with/without A/B
- [ ] 9. Update README + CHANGELOG
- [ ] 10. Report; do NOT commit
复制以下检查清单并按顺序执行:
- [ ] 1. 确认语言;若已支持则提前终止(仅执行基准测试)
- [ ] 2. 寻找语法包并进行健康检查(ABI/堆损坏检测)
- [ ] 3. 发现语法包的AST节点类型(使用dump-ast.mjs)
- [ ] 4. 接入语言(4个文件;有时需修改核心文件)
- [ ] 5. 构建+验证提取循环,直到验证通过
- [ ] 6. 添加提取测试并确保测试通过
- [ ] 7. 自动选择3个不同规模层级的热门仓库;添加至corpus.json
- [ ] 8. 对3个仓库进行基准测试:提取测试+有无CodeGraph的A/B对比
- [ ] 9. 更新README与CHANGELOG
- [ ] 10. 提交报告;禁止提交代码

Step 1 — Resolve + short-circuit

步骤1 — 确认语言并短路处理

Check whether the language is already wired: look for the token in the
LANGUAGES
const (
src/types.ts
) and the
EXTRACTORS
map (
src/extraction/languages/index.ts
). If it is already supported (e.g.
typescript
,
rust
), skip Steps 2–6 and go straight to benchmarking (Steps 7–8) to validate/measure it — note in the report that no code changed.
检查该语言是否已接入:在
src/types.ts
LANGUAGES
常量和
src/extraction/languages/index.ts
EXTRACTORS
映射中查找语言标识。若已支持(如
typescript
rust
),跳过步骤2-6,直接进入基准测试环节(步骤7-8)以验证/评估性能——需在报告中注明未修改任何代码。

Step 2 — Find a grammar, then health-check it

步骤2 — 寻找语法包并进行健康检查

bash
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp
  • Present → likely off-the-shelf;
    grammars.ts
    resolves it from
    tree-sitter-wasms
    automatically. (Many languages: elixir, zig, ocaml, solidity, toml, yaml, …)
  • Absent → vendor a
    .wasm
    into
    src/extraction/wasm/
    (like
    pascal
    /
    scala
    /
    lua
    ) and add the token to the vendored branch in Step 4.
Always health-check before writing an extractor — a present grammar can still be unusable:
bash
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>
It prints the grammar's ABI version and parses a valid sample many times in a multi-grammar runtime. If it FAILs (ERROR trees on valid code — an old ABI corrupting the shared WASM heap, which silently drops nested calls/imports on every file after the first; e.g. the tree-sitter-wasms Lua grammar is ABI 13 and fails), do NOT use that wasm. Vendor a newer (ABI 14/15) build instead:
bash
npm pack @tree-sitter-grammars/tree-sitter-<lang>   # often ships a prebuilt *.wasm
bash
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp
  • 已存在 → 通常为现成包;
    grammars.ts
    会自动从
    tree-sitter-wasms
    中解析。(支持多种语言:elixir、zig、ocaml、solidity、toml、yaml等)
  • 不存在 → 将
    .wasm
    包引入
    src/extraction/wasm/
    (如
    pascal
    /
    scala
    /
    lua
    ),并在步骤4中将标识添加到引入分支。
在编写提取器前必须进行健康检查——已存在的语法包仍可能无法使用:
bash
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>
该命令会打印语法包的ABI版本,并在多语法运行时环境中多次解析有效样本。若失败(有效代码解析出ERROR树——旧ABI损坏共享WASM堆,会导致第一个文件之后的所有文件静默丢失嵌套调用/导入;例如tree-sitter-wasms的Lua语法包为ABI 13,会失败),请勿使用该wasm包。请引入更新版本(ABI 14/15)的构建包:
bash
npm pack @tree-sitter-grammars/tree-sitter-<lang>   # 通常会附带预构建的*.wasm

or build one: npx tree-sitter build --wasm (needs Docker/emscripten)

或自行构建:npx tree-sitter build --wasm (需要Docker/emscripten)

cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm
then add the token to the vendored branch in Step 4 and re-run check-grammar on
the vendored path until it PASSes. **If you cannot obtain a healthy wasm, STOP
and tell the user.**
cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm
然后在步骤4中将标识添加到引入分支,重新在引入路径上运行check-grammar直到通过。**若无法获取可用的wasm包,请停止操作并告知用户。**

Step 3 — Discover AST node types

步骤3 — 发现AST节点类型

Get a representative source file (write a small sample covering functions, classes/structs, imports, enums; or
curl
a raw file from a known repo), then:
bash
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>
获取一个代表性的源文件(编写涵盖函数、类/结构体、导入、枚举的小型样本;或从已知仓库中
curl
原始文件),然后执行:
bash
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>

vendored grammar: pass the wasm path instead of the token

引入的语法包:传入wasm路径而非语言标识

node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
The frequency table + field names (`name:`, `parameters:`, `body:`,
`return_type:`) tell you what to map. Open the existing extractor closest to the
language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
(top-level methods + receivers).
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
频率表+字段名(`name:`、`parameters:`、`body:`、`return_type:`)会告知你需要映射的内容。选择与该语言范式最接近的现有提取器作为模板:`rust.ts`/`scala.ts`(函数式、特征)、`java.ts`/`csharp.ts`(面向对象)、`python.ts`/`ruby.ts`(脚本语言)、`go.ts`(顶层方法+接收器)。

Step 4 — Wire the language (4 files)

步骤4 — 接入语言(4个文件)

These are exact, fragile wiring — match the existing style precisely:
  1. src/types.ts
    — TWO edits:
    • add
      '<lang>',
      to the
      LANGUAGES
      const (before
      'unknown'
      );
    • add
      '**/*.<ext>',
      to
      DEFAULT_CONFIG.include
      . Don't skip this — it's the file-scan allowlist; without the glob,
      codegraph init
      finds 0 files even though detection/extraction are wired.
  2. src/extraction/grammars.ts
    — three maps:
    • WASM_GRAMMAR_FILES
      :
      <lang>: 'tree-sitter-<lang>.wasm',
    • EXTENSION_MAP
      : each file extension →
      '<lang>'
      (e.g.
      '.lua': 'lua',
      )
    • getLanguageDisplayName
      :
      <lang>: '<Display Name>',
    • vendored only: add
      <lang>
      to the
      (lang === 'pascal' || lang === 'scala' || …)
      wasm-path branch.
  3. src/extraction/languages/<lang>.ts
    — new file exporting
    export const <lang>Extractor: LanguageExtractor = { … }
    . Map the node types from Step 3. Required fields:
    functionTypes
    ,
    classTypes
    ,
    methodTypes
    ,
    interfaceTypes
    ,
    structTypes
    ,
    enumTypes
    ,
    typeAliasTypes
    ,
    importTypes
    ,
    callTypes
    ,
    variableTypes
    ,
    nameField
    ,
    bodyField
    ,
    paramsField
    . Add hooks as the grammar needs them (
    getSignature
    ,
    getVisibility
    ,
    isExported
    ,
    extractImport
    ,
    visitNode
    ,
    getReceiverType
    ,
    interfaceKind
    ,
    enumMemberTypes
    , etc. — see
    src/extraction/tree-sitter-types.ts
    ).
  4. src/extraction/languages/index.ts
    import { <lang>Extractor } from './<lang>';
    and add
    <lang>: <lang>Extractor,
    to
    EXTRACTORS
    .
Sometimes a 5th, core touch in
src/extraction/tree-sitter.ts
— variable extraction has per-language branches in
extractVariable
(the generic fallback only finds direct
identifier
/
variable_declarator
children). If the grammar nests declared names (e.g. Lua's
variable_declaration → variable_list
), add a
} else if (this.language === '<lang>')
branch there, mirroring the existing ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby
require
is a call) are handled in the extractor's
visitNode
hook instead.
这些是精确且易出错的接入操作——需严格匹配现有代码风格:
  1. src/types.ts
    — 两处修改:
    • LANGUAGES
      常量中添加
      '<lang>',
      (位于
      'unknown'
      之前);
    • DEFAULT_CONFIG.include
      中添加
      '**/*.<ext>',
      请勿跳过此步骤——这是文件扫描的允许列表;若缺少该通配符,即使检测/提取已接入,
      codegraph init
      也会找到0个文件
  2. src/extraction/grammars.ts
    — 三个映射:
    • WASM_GRAMMAR_FILES
      :
      <lang>: 'tree-sitter-<lang>.wasm',
    • EXTENSION_MAP
      : 每个文件扩展名映射到
      '<lang>'
      (例如
      '.lua': 'lua',
    • getLanguageDisplayName
      :
      <lang>: '<Display Name>',
    • 仅引入包需要:将
      <lang>
      添加到
      (lang === 'pascal' || lang === 'scala' || …)
      的wasm路径分支中。
  3. src/extraction/languages/<lang>.ts
    — 新建文件,导出
    export const <lang>Extractor: LanguageExtractor = { … }
    。映射步骤3中的节点类型。必填字段:
    functionTypes
    classTypes
    methodTypes
    interfaceTypes
    structTypes
    enumTypes
    typeAliasTypes
    importTypes
    callTypes
    variableTypes
    nameField
    bodyField
    paramsField
    。根据语法包需求添加钩子(
    getSignature
    getVisibility
    isExported
    extractImport
    visitNode
    getReceiverType
    interfaceKind
    enumMemberTypes
    等——详见
    src/extraction/tree-sitter-types.ts
    )。
  4. src/extraction/languages/index.ts
    — 添加
    import { <lang>Extractor } from './<lang>';
    ,并在
    EXTRACTORS
    中添加
    <lang>: <lang>Extractor,
有时需要修改第5个核心文件
src/extraction/tree-sitter.ts
——变量提取在
extractVariable
中有按语言分支的逻辑(通用回退仅能找到直接的
identifier
/
variable_declarator
子节点)。若语法包中声明的名称存在嵌套(例如Lua的
variable_declaration → variable_list
),需在此处添加
} else if (this.language === '<lang>')
分支,镜像现有ts/python/go的实现。非独立节点的导入形式(Lua/Ruby的
require
是一个调用)则在提取器的
visitNode
钩子中处理。

Step 5 — Build + verify loop

步骤5 — 构建+验证循环

bash
npm run build            # tsc + copy-assets (copies any vendored *.wasm into dist/)
Index a small sample repo and check extraction:
bash
( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>
verify-extraction.mjs
fails (exit 1) if the language isn't detected or only
file
/
import
nodes were produced — the classic symptom of wrong node-type names. On FAIL or a thin WARN: re-run
dump-ast.mjs
on a richer file, fix the mappings in
<lang>.ts
,
npm run build
, re-index, re-verify. Repeat until PASS.
bash
npm run build            # tsc + copy-assets(将所有引入的*.wasm复制到dist/)
为小型样本仓库建立索引并检查提取结果:
bash
( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>
若未检测到语言,或仅生成
file
/
import
节点,
verify-extraction.mjs
会失败(退出码1)——这是节点类型名称错误的典型症状。若失败或出现警告:在更丰富的文件上重新运行
dump-ast.mjs
,修复
<lang>.ts
中的映射,执行
npm run build
,重新建立索引,再次验证。重复直到验证通过。

Step 6 — Tests

步骤6 — 测试

Add to
__tests__/extraction.test.ts
, modeled on the
Rust Extraction
block:
  • a
    detectLanguage
    assertion in
    describe('Language Detection')
  • a
    describe('<Lang> Extraction')
    block asserting functions/classes/imports are extracted from an inline source string.
bash
npx vitest run __tests__/extraction.test.ts
Green before continuing.
__tests__/extraction.test.ts
中添加测试,以
Rust Extraction
块为模板:
  • describe('Language Detection')
    中添加
    detectLanguage
    断言
  • 添加
    describe('<Lang> Extraction')
    块,断言从内联源字符串中提取出函数/类/导入
bash
npx vitest run __tests__/extraction.test.ts
确保测试通过后再继续。

Step 7 — Auto-pick 3 repos + corpus

步骤7 — 自动选择3个仓库+语料库

Pick without asking. Find candidates, then curate 3 that are genuinely
<lang>
-dominant, one per size tier:
bash
gh search repos --language=<lang> --sort=stars --limit 40 \
  --json fullName,stargazerCount,description
Tiers (match
corpus.json
): Small <~150 files · Medium ~150–1500 · Large >~1500. Skip repos that are tagged
<lang>
but mostly another language. Write one cross-file architecture question per repo (the kind that needs tracing across files). Add a
"<Language>"
block to
.claude/skills/agent-eval/corpus.json
(fields:
name
,
repo
,
size
,
files
,
question
) so
/agent-eval
can reuse them.
无需询问,直接选择。找到候选仓库后,筛选出3个真正以
<lang>
为主的仓库,覆盖三个规模层级:
bash
gh search repos --language=<lang> --sort=stars --limit 40 \\
  --json fullName,stargazerCount,description
规模层级(匹配
corpus.json
):小型 <~150个文件 · 中型 ~150–1500个文件 · 大型 >~1500个文件。跳过标记为
<lang>
但主要使用其他语言的仓库。为每个仓库编写一个跨文件架构问题(需要跨文件追踪的类型)。在
.claude/skills/agent-eval/corpus.json
中添加
"<Language>"
块(字段:
name
repo
size
files
question
),以便
/agent-eval
可以复用这些内容。

Step 8 — Benchmark all 3 (extraction + A/B)

步骤8 — 对3个仓库进行基准测试(提取+A/B对比)

Make the dev build the codegraph on PATH once, then loop:
bash
npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # ×3
bench.sh
clones (shared
/tmp/codegraph-corpus
), wipes + indexes, runs
verify-extraction.mjs
, then the with/without retrieval A/B via
scripts/agent-eval/run-all.sh
(skips the paid A/B if extraction is broken). Read each
parse-run.mjs
summary printed by
run-all.sh
: tool calls, file
Read
s, Grep/Bash, codegraph-tool calls, duration, and cost — for both the
with
and
without
arms. After the loop, restore the dev link if needed:
./scripts/local-install.sh
.
先将开发构建版本设置为PATH中的codegraph,然后循环执行:
bash
npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # 执行3次
bench.sh
会克隆仓库(共享
/tmp/codegraph-corpus
)、清理并建立索引、运行
verify-extraction.mjs
,然后通过
scripts/agent-eval/run-all.sh
执行有无检索的A/B对比(若提取失败则跳过付费A/B测试)。查看
run-all.sh
打印的每个
parse-run.mjs
摘要:工具调用、文件读取、Grep/Bash、codegraph-tool调用、时长和成本——包含
使用CodeGraph
不使用CodeGraph
两组数据。循环结束后,若需要可恢复开发链接:
./scripts/local-install.sh

Step 9 — Docs + CHANGELOG

步骤9 — 文档+CHANGELOG

  • README.md: add
    <Lang>
    to the "19+ Languages" feature bullet, and add a row to the Supported Languages table:
    | <Lang> | \
    .ext` | Full support (classes, methods, …) |`.
  • CHANGELOG.md: add an
    ## [Unreleased]
    section at the top (above the latest version) with
    ### Added
    → a user-perspective bullet, e.g. "CodeGraph now indexes <Lang> (
    .ext
    ) — functions, classes, imports, and call edges."
    If
    ## [Unreleased]
    already exists, append under it. (It's folded into the next versioned block at release time.)
  • README.md: 将
    <Lang>
    添加到“19+ Languages”功能项目符号中,并在支持的语言表格中添加一行:
    | <Lang> | \\
    .ext\
     | 完整支持(类、方法等) |
  • CHANGELOG.md: 在顶部添加
    ## [Unreleased]
    章节(位于最新版本上方),在
    ### Added
    下添加用户视角的项目符号,例如:"CodeGraph现在支持索引**<Lang>**(
    .ext
    )——包括函数、类、导入和调用边。"
    ## [Unreleased]
    已存在,则追加到该章节下。(发布时会合并到下一个版本块中。)

Step 10 — Report (do NOT commit)

步骤10 — 提交报告(禁止提交代码)

Summarize for review:
  • Files changed: the 4 wiring edits + new extractor + tests + README + CHANGELOG + corpus.json (+ any vendored
    .wasm
    ).
  • Extraction per repo: files / nodes / edges /
    verify-extraction
    result.
  • A/B per repo:
    with
    vs
    without
    (tool calls, file Reads, cost) and a one-line verdict — did codegraph reduce effort, and did both arms reach a correct answer?
  • Gaps / follow-ups (node types not yet mapped, resolution edges missing, framework routes, etc.).
Hand the changes to the user. Do not run
git commit
/
push
or publish — releases go through the GitHub Actions Release workflow.
为审核总结以下内容:
  • 修改的文件: 4处接入修改 + 新提取器 + 测试 + README + CHANGELOG + corpus.json(+任何引入的
    .wasm
    )。
  • 每个仓库的提取结果: 文件数/节点数/边数/
    verify-extraction
    结果。
  • 每个仓库的A/B对比:
    使用CodeGraph
    vs
    不使用CodeGraph
    (工具调用、文件读取、成本),以及一行结论——CodeGraph是否减少了工作量,两组是否都得出了正确答案?
  • 差距/后续工作(未映射的节点类型、缺失的解析边、框架路由等)。
将修改内容提交给用户。禁止运行
git commit
/
push
或发布——发布需通过GitHub Actions Release工作流。

Notes

注意事项

  • The A/B spawns real paid
    claude -p
    runs (opus,
    --max-budget-usd
    ), 2 arms × 3 repos. The corpus dir
    /tmp/codegraph-corpus
    is shared with
    /agent-eval
    , so clones are reused across runs.
  • Any new
    *.wasm
    must live in
    src/extraction/wasm/
    copy-assets
    (run by
    npm run build
    ) ships it; otherwise it won't be in
    dist/
    .
  • An index must be served by the same binary that built it. Step 8 builds + links the dev build first, so this holds.
  • If a grammar can't be obtained, or extraction can't reach PASS, STOP and report — don't ship a half-wired language.
  • A/B对比会启动真实的付费
    claude -p
    运行实例(opus,
    --max-budget-usd
    ),3个仓库×2组对比。语料库目录
    /tmp/codegraph-corpus
    /agent-eval
    共享,因此克隆的仓库会在多次运行中复用。
  • 所有新的
    *.wasm
    必须放在
    src/extraction/wasm/
    中——
    copy-assets
    (由
    npm run build
    调用)会将其打包;否则不会出现在
    dist/
    中。
  • 索引必须由构建它的同一二进制文件提供服务。步骤8会先构建并链接开发版本,因此可满足此要求。
  • 若无法获取语法包,或提取无法通过验证,请停止操作并报告——不要交付半接入的语言支持。",