add-lang

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Add a language to CodeGraph

为CodeGraph添加新语言

Wire a new tree-sitter language into codegraph's extraction pipeline, prove it extracts real symbols on popular repos, and prove it beats no-codegraph for an agent. Runs fully autonomously — pick repos, benchmark, update docs, then report. Never commit, push, publish, or tag (house rule); leave all changes for the user to review.

The argument is the language token used throughout the

Language

union, e.g.

lua

elixir

zig

. If none was given, ask which language. Use the lowercase single-token form everywhere (

csharp

, not

c#

将新的tree-sitter语言接入CodeGraph的提取流程，验证其能在热门仓库中提取真实符号，并证明其在Agent场景下优于无CodeGraph的方案。全程完全自主运行——选择仓库、执行基准测试、更新文档，然后提交报告。禁止提交、推送、发布或打标签（内部规则）；所有修改需留待用户审核。

参数为

Language

联合类型中使用的语言标识，例如

lua

、

elixir

、

zig

。若未指定语言，请询问用户。所有场景下均使用小写单标识形式（如

csharp

，而非

c#

）。

Prerequisites

前置条件

Run from the codegraph repo root.
```
node
```
,
```
git
```
,
```
gh
```
, and a logged-in
```
claude
```
CLI (the benchmark spawns real
```
claude -p
```
runs).
The benchmark uses the local dev build — Step 8 builds + links it on PATH.

需在CodeGraph仓库根目录下运行。需安装
```
node
```
、
```
git
```
、
```
gh
```
，并登录
```
claude
```
CLI（基准测试会启动真实的
```
claude -p
```
运行实例）。
基准测试使用本地开发构建版本——步骤8会将其构建并链接至PATH。

Workflow

工作流程

Copy this checklist and work through it in order:

- [ ] 1. Resolve language; bail early if already supported (just benchmark)
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
- [ ] 5. Build + verify-extraction loop until PASS
- [ ] 6. Add extraction tests; make them green
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
- [ ] 8. Benchmark all 3: extraction + with/without A/B
- [ ] 9. Update README + CHANGELOG
- [ ] 10. Report; do NOT commit

复制以下检查清单并按顺序执行：

- [ ] 1. 确认语言；若已支持则提前终止（仅执行基准测试）
- [ ] 2. 寻找语法包并进行健康检查（ABI/堆损坏检测）
- [ ] 3. 发现语法包的AST节点类型（使用dump-ast.mjs）
- [ ] 4. 接入语言（4个文件；有时需修改核心文件）
- [ ] 5. 构建+验证提取循环，直到验证通过
- [ ] 6. 添加提取测试并确保测试通过
- [ ] 7. 自动选择3个不同规模层级的热门仓库；添加至corpus.json
- [ ] 8. 对3个仓库进行基准测试：提取测试+有无CodeGraph的A/B对比
- [ ] 9. 更新README与CHANGELOG
- [ ] 10. 提交报告；禁止提交代码

Step 1 — Resolve + short-circuit

步骤1 — 确认语言并短路处理

Check whether the language is already wired: look for the token in the

LANGUAGES

const (

src/types.ts

) and the

EXTRACTORS

map (

src/extraction/languages/index.ts

). If it is already supported (e.g.

typescript

rust

), skip Steps 2–6 and go straight to benchmarking (Steps 7–8) to validate/measure it — note in the report that no code changed.

检查该语言是否已接入：在

src/types.ts

的

LANGUAGES

常量和

src/extraction/languages/index.ts

的

EXTRACTORS

映射中查找语言标识。若已支持（如

typescript

、

rust

），跳过步骤2-6，直接进入基准测试环节（步骤7-8）以验证/评估性能——需在报告中注明未修改任何代码。

Step 2 — Find a grammar, then health-check it

步骤2 — 寻找语法包并进行健康检查

bash

ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp

Present → likely off-the-shelf;
```
grammars.ts
```
resolves it from
```
tree-sitter-wasms
```
automatically. (Many languages: elixir, zig, ocaml, solidity, toml, yaml, …)
Absent → vendor a
```
.wasm
```
into
```
src/extraction/wasm/
```
(like
```
pascal
```
/
```
scala
```
/
```
lua
```
) and add the token to the vendored branch in Step 4.

Always health-check before writing an extractor — a present grammar can still be unusable:

bash

node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>

It prints the grammar's ABI version and parses a valid sample many times in a multi-grammar runtime. If it FAILs (ERROR trees on valid code — an old ABI corrupting the shared WASM heap, which silently drops nested calls/imports on every file after the first; e.g. the tree-sitter-wasms Lua grammar is ABI 13 and fails), do NOT use that wasm. Vendor a newer (ABI 14/15) build instead:

bash

npm pack @tree-sitter-grammars/tree-sitter-<lang>   # often ships a prebuilt *.wasm

bash

ls node_modules/tree-sitter-wasms/out/ | grep -i <lang>   # csharp -> c_sharp

已存在 → 通常为现成包；
```
grammars.ts
```
会自动从
```
tree-sitter-wasms
```
中解析。（支持多种语言：elixir、zig、ocaml、solidity、toml、yaml等）
不存在 → 将
```
.wasm
```
包引入
```
src/extraction/wasm/
```
（如
```
pascal
```
/
```
scala
```
/
```
lua
```
），并在步骤4中将标识添加到引入分支。

在编写提取器前必须进行健康检查——已存在的语法包仍可能无法使用：

bash

node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>

该命令会打印语法包的ABI版本，并在多语法运行时环境中多次解析有效样本。若失败（有效代码解析出ERROR树——旧ABI损坏共享WASM堆，会导致第一个文件之后的所有文件静默丢失嵌套调用/导入；例如tree-sitter-wasms的Lua语法包为ABI 13，会失败），请勿使用该wasm包。请引入更新版本（ABI 14/15）的构建包：

bash

npm pack @tree-sitter-grammars/tree-sitter-<lang>   # 通常会附带预构建的*.wasm

or build one: npx tree-sitter build --wasm (needs Docker/emscripten)

或自行构建：npx tree-sitter build --wasm (需要Docker/emscripten)

cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm

then add the token to the vendored branch in Step 4 and re-run check-grammar on
the vendored path until it PASSes. **If you cannot obtain a healthy wasm, STOP
and tell the user.**

cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm

然后在步骤4中将标识添加到引入分支，重新在引入路径上运行check-grammar直到通过。**若无法获取可用的wasm包，请停止操作并告知用户。**

Step 3 — Discover AST node types

步骤3 — 发现AST节点类型

Get a representative source file (write a small sample covering functions, classes/structs, imports, enums; or

curl

a raw file from a known repo), then:

bash

node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>

获取一个代表性的源文件（编写涵盖函数、类/结构体、导入、枚举的小型样本；或从已知仓库中

curl

原始文件），然后执行：

bash

node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>

vendored grammar: pass the wasm path instead of the token

引入的语法包：传入wasm路径而非语言标识

node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>

The frequency table + field names (`name:`, `parameters:`, `body:`,
`return_type:`) tell you what to map. Open the existing extractor closest to the
language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
(top-level methods + receivers).

node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>

频率表+字段名（`name:`、`parameters:`、`body:`、`return_type:`）会告知你需要映射的内容。选择与该语言范式最接近的现有提取器作为模板：`rust.ts`/`scala.ts`（函数式、特征）、`java.ts`/`csharp.ts`（面向对象）、`python.ts`/`ruby.ts`（脚本语言）、`go.ts`（顶层方法+接收器）。

Step 4 — Wire the language (4 files)

步骤4 — 接入语言（4个文件）

These are exact, fragile wiring — match the existing style precisely:

src/types.ts
— TWO edits:
- add
```
'<lang>',
```
  to the
```
LANGUAGES
```
  const (before
```
'unknown'
```
  );
- add
```
'**/*.<ext>',
```
  to
```
DEFAULT_CONFIG.include
```
  . Don't skip this — it's the file-scan allowlist; without the glob,
```
codegraph init
```
  finds 0 files even though detection/extraction are wired.

src/extraction/grammars.ts
— three maps:

WASM_GRAMMAR_FILES

<lang>: 'tree-sitter-<lang>.wasm',

```
EXTENSION_MAP
```
: each file extension →
```
'<lang>'
```
(e.g.
```
'.lua': 'lua',
```
)

getLanguageDisplayName

<lang>: '<Display Name>',

vendored only: add

<lang>

to the

(lang === 'pascal' || lang === 'scala' || …)

wasm-path branch.

src/extraction/languages/<lang>.ts
— new file exporting

export const <lang>Extractor: LanguageExtractor = { … }

. Map the node types from Step 3. Required fields:

functionTypes

classTypes

methodTypes

interfaceTypes

structTypes

enumTypes

typeAliasTypes

importTypes

callTypes

variableTypes

nameField

bodyField

paramsField

. Add hooks as the grammar needs them (

getSignature

getVisibility

isExported

extractImport

visitNode

getReceiverType

interfaceKind

enumMemberTypes

, etc. — see

src/extraction/tree-sitter-types.ts

src/extraction/languages/index.ts
—

import { <lang>Extractor } from './<lang>';

and add

<lang>: <lang>Extractor,

EXTRACTORS

Sometimes a 5th, core touch in
src/extraction/tree-sitter.ts
— variable extraction has per-language branches in

extractVariable

(the generic fallback only finds direct

identifier

variable_declarator

children). If the grammar nests declared names (e.g. Lua's

variable_declaration → variable_list

), add a

} else if (this.language === '<lang>')

branch there, mirroring the existing ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby

require

is a call) are handled in the extractor's

visitNode

hook instead.

这些是精确且易出错的接入操作——需严格匹配现有代码风格：

src/types.ts
— 两处修改：
- 在
```
LANGUAGES
```
  常量中添加
```
'<lang>',
```
  （位于
```
'unknown'
```
  之前）；
- 在
```
DEFAULT_CONFIG.include
```
  中添加
```
'**/*.<ext>',
```
  。请勿跳过此步骤——这是文件扫描的允许列表；若缺少该通配符，即使检测/提取已接入，
```
codegraph init
```
  也会找到0个文件。

src/extraction/grammars.ts
— 三个映射：

WASM_GRAMMAR_FILES

<lang>: 'tree-sitter-<lang>.wasm',

```
EXTENSION_MAP
```
: 每个文件扩展名映射到
```
'<lang>'
```
（例如
```
'.lua': 'lua',
```
）

getLanguageDisplayName

<lang>: '<Display Name>',

仅引入包需要：将
```
<lang>
```
添加到
```
(lang === 'pascal' || lang === 'scala' || …)
```
的wasm路径分支中。

src/extraction/languages/<lang>.ts
— 新建文件，导出

export const <lang>Extractor: LanguageExtractor = { … }

。映射步骤3中的节点类型。必填字段：

functionTypes

、

classTypes

、

methodTypes

、

interfaceTypes

、

structTypes

、

enumTypes

、

typeAliasTypes

、

importTypes

、

callTypes

、

variableTypes

、

nameField

、

bodyField

、

paramsField

。根据语法包需求添加钩子（

getSignature

、

getVisibility

、

isExported

、

extractImport

、

visitNode

、

getReceiverType

、

interfaceKind

、

enumMemberTypes

等——详见

src/extraction/tree-sitter-types.ts

）。

src/extraction/languages/index.ts
— 添加

import { <lang>Extractor } from './<lang>';

，并在

EXTRACTORS

中添加

<lang>: <lang>Extractor,

。

有时需要修改第5个核心文件
src/extraction/tree-sitter.ts
——变量提取在

extractVariable

中有按语言分支的逻辑（通用回退仅能找到直接的

identifier

variable_declarator

子节点）。若语法包中声明的名称存在嵌套（例如Lua的

variable_declaration → variable_list

），需在此处添加

} else if (this.language === '<lang>')

分支，镜像现有ts/python/go的实现。非独立节点的导入形式（Lua/Ruby的

require

是一个调用）则在提取器的

visitNode

钩子中处理。

Step 5 — Build + verify loop

步骤5 — 构建+验证循环

bash

npm run build            # tsc + copy-assets (copies any vendored *.wasm into dist/)

Index a small sample repo and check extraction:

bash

( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>

verify-extraction.mjs

fails (exit 1) if the language isn't detected or only

file

import

nodes were produced — the classic symptom of wrong node-type names. On FAIL or a thin WARN: re-run

dump-ast.mjs

on a richer file, fix the mappings in

<lang>.ts

npm run build

, re-index, re-verify. Repeat until PASS.

bash

npm run build            # tsc + copy-assets（将所有引入的*.wasm复制到dist/）

为小型样本仓库建立索引并检查提取结果：

bash

( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>

若未检测到语言，或仅生成

file

import

节点，

verify-extraction.mjs

会失败（退出码1）——这是节点类型名称错误的典型症状。若失败或出现警告：在更丰富的文件上重新运行

dump-ast.mjs

，修复

<lang>.ts

中的映射，执行

npm run build

，重新建立索引，再次验证。重复直到验证通过。

Step 6 — Tests

步骤6 — 测试

Add to

__tests__/extraction.test.ts

, modeled on the

Rust Extraction

block:

detectLanguage

assertion in

describe('Language Detection')

a
```
describe('<Lang> Extraction')
```
block asserting functions/classes/imports are extracted from an inline source string.

bash

npx vitest run __tests__/extraction.test.ts

Green before continuing.

在

__tests__/extraction.test.ts

中添加测试，以

Rust Extraction

块为模板：

在

describe('Language Detection')

中添加

detectLanguage

断言

添加
```
describe('<Lang> Extraction')
```
块，断言从内联源字符串中提取出函数/类/导入

bash

npx vitest run __tests__/extraction.test.ts

确保测试通过后再继续。

Step 7 — Auto-pick 3 repos + corpus

步骤7 — 自动选择3个仓库+语料库

Pick without asking. Find candidates, then curate 3 that are genuinely

<lang>

-dominant, one per size tier:

bash

gh search repos --language=<lang> --sort=stars --limit 40 \
  --json fullName,stargazerCount,description

Tiers (match

corpus.json

): Small <~150 files · Medium ~150–1500 · Large >~1500. Skip repos that are tagged

<lang>

but mostly another language. Write one cross-file architecture question per repo (the kind that needs tracing across files). Add a

"<Language>"

block to

.claude/skills/agent-eval/corpus.json

(fields:

name

repo

size

files

question

) so

/agent-eval

can reuse them.

无需询问，直接选择。找到候选仓库后，筛选出3个真正以

<lang>

为主的仓库，覆盖三个规模层级：

bash

gh search repos --language=<lang> --sort=stars --limit 40 \\
  --json fullName,stargazerCount,description

规模层级（匹配

corpus.json

）：小型 <~150个文件 · 中型 ~150–1500个文件 · 大型 >~1500个文件。跳过标记为

<lang>

但主要使用其他语言的仓库。为每个仓库编写一个跨文件架构问题（需要跨文件追踪的类型）。在

.claude/skills/agent-eval/corpus.json

中添加

"<Language>"

块（字段：

name

、

repo

、

size

、

files

、

question

），以便

/agent-eval

可以复用这些内容。

Step 8 — Benchmark all 3 (extraction + A/B)

步骤8 — 对3个仓库进行基准测试（提取+A/B对比）

Make the dev build the codegraph on PATH once, then loop:

bash

npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # ×3

bench.sh

clones (shared

/tmp/codegraph-corpus

), wipes + indexes, runs

verify-extraction.mjs

, then the with/without retrieval A/B via

scripts/agent-eval/run-all.sh

(skips the paid A/B if extraction is broken). Read each

parse-run.mjs

summary printed by

run-all.sh

: tool calls, file

Read

s, Grep/Bash, codegraph-tool calls, duration, and cost — for both the

with

and

without

arms. After the loop, restore the dev link if needed:

./scripts/local-install.sh

先将开发构建版本设置为PATH中的codegraph，然后循环执行：

bash

npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless   # 执行3次

bench.sh

会克隆仓库（共享

/tmp/codegraph-corpus

）、清理并建立索引、运行

verify-extraction.mjs

，然后通过

scripts/agent-eval/run-all.sh

执行有无检索的A/B对比（若提取失败则跳过付费A/B测试）。查看

run-all.sh

打印的每个

parse-run.mjs

摘要：工具调用、文件读取、Grep/Bash、codegraph-tool调用、时长和成本——包含

使用CodeGraph

和

不使用CodeGraph

两组数据。循环结束后，若需要可恢复开发链接：

./scripts/local-install.sh

。

Step 9 — Docs + CHANGELOG

步骤9 — 文档+CHANGELOG

README.md: add
```
<Lang>
```
to the "19+ Languages" feature bullet, and add a row to the Supported Languages table:
```
| <Lang> | \
```
.ext` | Full support (classes, methods, …) |`.
CHANGELOG.md: add an
```
## [Unreleased]
```
section at the top (above the latest version) with
```
### Added
```
→ a user-perspective bullet, e.g. "CodeGraph now indexes <Lang> (
.ext
) — functions, classes, imports, and call edges." If
```
## [Unreleased]
```
already exists, append under it. (It's folded into the next versioned block at release time.)

README.md: 将
```
<Lang>
```
添加到“19+ Languages”功能项目符号中，并在支持的语言表格中添加一行：
```
| <Lang> | \\
```
.ext\
```
 | 完整支持（类、方法等） |
```
。
CHANGELOG.md: 在顶部添加
```
## [Unreleased]
```
章节（位于最新版本上方），在
```
### Added
```
下添加用户视角的项目符号，例如："CodeGraph现在支持索引**<Lang>**（
.ext
）——包括函数、类、导入和调用边。" 若
```
## [Unreleased]
```
已存在，则追加到该章节下。（发布时会合并到下一个版本块中。）

Step 10 — Report (do NOT commit)

步骤10 — 提交报告（禁止提交代码）

Summarize for review:

Files changed: the 4 wiring edits + new extractor + tests + README + CHANGELOG + corpus.json (+ any vendored
```
.wasm
```
).
Extraction per repo: files / nodes / edges /
```
verify-extraction
```
result.
A/B per repo:
```
with
```
vs
```
without
```
(tool calls, file Reads, cost) and a one-line verdict — did codegraph reduce effort, and did both arms reach a correct answer?
Gaps / follow-ups (node types not yet mapped, resolution edges missing, framework routes, etc.).

Hand the changes to the user. Do not run

git commit

push

or publish — releases go through the GitHub Actions Release workflow.

为审核总结以下内容：

修改的文件: 4处接入修改 + 新提取器 + 测试 + README + CHANGELOG + corpus.json（+任何引入的
```
.wasm
```
）。
每个仓库的提取结果: 文件数/节点数/边数/
```
verify-extraction
```
结果。
每个仓库的A/B对比:
```
使用CodeGraph
```
vs
```
不使用CodeGraph
```
（工具调用、文件读取、成本），以及一行结论——CodeGraph是否减少了工作量，两组是否都得出了正确答案？
差距/后续工作（未映射的节点类型、缺失的解析边、框架路由等）。

将修改内容提交给用户。禁止运行

git commit

push

或发布——发布需通过GitHub Actions Release工作流。

Notes

注意事项

The A/B spawns real paid
```
claude -p
```
runs (opus,
```
--max-budget-usd
```
), 2 arms × 3 repos. The corpus dir
```
/tmp/codegraph-corpus
```
is shared with
```
/agent-eval
```
, so clones are reused across runs.
Any new
```
*.wasm
```
must live in
```
src/extraction/wasm/
```
—
```
copy-assets
```
(run by
```
npm run build
```
) ships it; otherwise it won't be in
```
dist/
```
.
An index must be served by the same binary that built it. Step 8 builds + links the dev build first, so this holds.
If a grammar can't be obtained, or extraction can't reach PASS, STOP and report — don't ship a half-wired language.

A/B对比会启动真实的付费
```
claude -p
```
运行实例（opus，
```
--max-budget-usd
```
），3个仓库×2组对比。语料库目录
```
/tmp/codegraph-corpus
```
与
```
/agent-eval
```
共享，因此克隆的仓库会在多次运行中复用。
所有新的
```
*.wasm
```
必须放在
```
src/extraction/wasm/
```
中——
```
copy-assets
```
（由
```
npm run build
```
调用）会将其打包；否则不会出现在
```
dist/
```
中。
索引必须由构建它的同一二进制文件提供服务。步骤8会先构建并链接开发版本，因此可满足此要求。
若无法获取语法包，或提取无法通过验证，请停止操作并报告——不要交付半接入的语言支持。",