add-lang
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdd a language to CodeGraph
为CodeGraph添加新语言
Wire a new tree-sitter language into codegraph's extraction pipeline, prove it
extracts real symbols on popular repos, and prove it beats no-codegraph for an
agent. Runs fully autonomously — pick repos, benchmark, update docs, then
report. Never commit, push, publish, or tag (house rule); leave all changes
for the user to review.
The argument is the language token used throughout the union, e.g.
, , . If none was given, ask which language. Use the lowercase
single-token form everywhere (, not ).
Languageluaelixirzigcsharpc#将新的tree-sitter语言接入CodeGraph的提取流程,验证其能在热门仓库中提取真实符号,并证明其在Agent场景下优于无CodeGraph的方案。全程完全自主运行——选择仓库、执行基准测试、更新文档,然后提交报告。禁止提交、推送、发布或打标签(内部规则);所有修改需留待用户审核。
参数为联合类型中使用的语言标识,例如、、。若未指定语言,请询问用户。所有场景下均使用小写单标识形式(如,而非)。
Languageluaelixirzigcsharpc#Prerequisites
前置条件
- Run from the codegraph repo root. ,
node,git, and a logged-inghCLI (the benchmark spawns realclauderuns).claude -p - The benchmark uses the local dev build — Step 8 builds + links it on PATH.
- 需在CodeGraph仓库根目录下运行。需安装、
node、git,并登录ghCLI(基准测试会启动真实的claude运行实例)。claude -p - 基准测试使用本地开发构建版本——步骤8会将其构建并链接至PATH。
Workflow
工作流程
Copy this checklist and work through it in order:
- [ ] 1. Resolve language; bail early if already supported (just benchmark)
- [ ] 2. Find a grammar + health-check it (ABI / heap corruption)
- [ ] 3. Discover the grammar's AST node types (dump-ast.mjs)
- [ ] 4. Wire the language (4 files; sometimes a 5th core touch)
- [ ] 5. Build + verify-extraction loop until PASS
- [ ] 6. Add extraction tests; make them green
- [ ] 7. Auto-pick 3 popular repos by size tier; add to corpus.json
- [ ] 8. Benchmark all 3: extraction + with/without A/B
- [ ] 9. Update README + CHANGELOG
- [ ] 10. Report; do NOT commit复制以下检查清单并按顺序执行:
- [ ] 1. 确认语言;若已支持则提前终止(仅执行基准测试)
- [ ] 2. 寻找语法包并进行健康检查(ABI/堆损坏检测)
- [ ] 3. 发现语法包的AST节点类型(使用dump-ast.mjs)
- [ ] 4. 接入语言(4个文件;有时需修改核心文件)
- [ ] 5. 构建+验证提取循环,直到验证通过
- [ ] 6. 添加提取测试并确保测试通过
- [ ] 7. 自动选择3个不同规模层级的热门仓库;添加至corpus.json
- [ ] 8. 对3个仓库进行基准测试:提取测试+有无CodeGraph的A/B对比
- [ ] 9. 更新README与CHANGELOG
- [ ] 10. 提交报告;禁止提交代码Step 1 — Resolve + short-circuit
步骤1 — 确认语言并短路处理
Check whether the language is already wired: look for the token in the
const () and the map
(). If it is already supported (e.g.
, ), skip Steps 2–6 and go straight to benchmarking
(Steps 7–8) to validate/measure it — note in the report that no code changed.
LANGUAGESsrc/types.tsEXTRACTORSsrc/extraction/languages/index.tstypescriptrust检查该语言是否已接入:在的常量和的映射中查找语言标识。若已支持(如、),跳过步骤2-6,直接进入基准测试环节(步骤7-8)以验证/评估性能——需在报告中注明未修改任何代码。
src/types.tsLANGUAGESsrc/extraction/languages/index.tsEXTRACTORStypescriptrustStep 2 — Find a grammar, then health-check it
步骤2 — 寻找语法包并进行健康检查
bash
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp- Present → likely off-the-shelf; resolves it from
grammars.tsautomatically. (Many languages: elixir, zig, ocaml, solidity, toml, yaml, …)tree-sitter-wasms - Absent → vendor a into
.wasm(likesrc/extraction/wasm//pascal/scala) and add the token to the vendored branch in Step 4.lua
Always health-check before writing an extractor — a present grammar can
still be unusable:
bash
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>It prints the grammar's ABI version and parses a valid sample many times in a
multi-grammar runtime. If it FAILs (ERROR trees on valid code — an old ABI
corrupting the shared WASM heap, which silently drops nested calls/imports on
every file after the first; e.g. the tree-sitter-wasms Lua grammar is ABI 13
and fails), do NOT use that wasm. Vendor a newer (ABI 14/15) build instead:
bash
npm pack @tree-sitter-grammars/tree-sitter-<lang> # often ships a prebuilt *.wasmbash
ls node_modules/tree-sitter-wasms/out/ | grep -i <lang> # csharp -> c_sharp- 已存在 → 通常为现成包;会自动从
grammars.ts中解析。(支持多种语言:elixir、zig、ocaml、solidity、toml、yaml等)tree-sitter-wasms - 不存在 → 将包引入
.wasm(如src/extraction/wasm//pascal/scala),并在步骤4中将标识添加到引入分支。lua
在编写提取器前必须进行健康检查——已存在的语法包仍可能无法使用:
bash
node scripts/add-lang/check-grammar.mjs <lang> path/to/valid-sample.<ext>该命令会打印语法包的ABI版本,并在多语法运行时环境中多次解析有效样本。若失败(有效代码解析出ERROR树——旧ABI损坏共享WASM堆,会导致第一个文件之后的所有文件静默丢失嵌套调用/导入;例如tree-sitter-wasms的Lua语法包为ABI 13,会失败),请勿使用该wasm包。请引入更新版本(ABI 14/15)的构建包:
bash
npm pack @tree-sitter-grammars/tree-sitter-<lang> # 通常会附带预构建的*.wasmor build one: npx tree-sitter build --wasm (needs Docker/emscripten)
或自行构建:npx tree-sitter build --wasm (需要Docker/emscripten)
cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm
then add the token to the vendored branch in Step 4 and re-run check-grammar on
the vendored path until it PASSes. **If you cannot obtain a healthy wasm, STOP
and tell the user.**cp <the>.wasm src/extraction/wasm/tree-sitter-<lang>.wasm
然后在步骤4中将标识添加到引入分支,重新在引入路径上运行check-grammar直到通过。**若无法获取可用的wasm包,请停止操作并告知用户。**Step 3 — Discover AST node types
步骤3 — 发现AST节点类型
Get a representative source file (write a small sample covering functions,
classes/structs, imports, enums; or a raw file from a known repo), then:
curlbash
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>获取一个代表性的源文件(编写涵盖函数、类/结构体、导入、枚举的小型样本;或从已知仓库中原始文件),然后执行:
curlbash
node scripts/add-lang/dump-ast.mjs <lang> path/to/sample.<ext>vendored grammar: pass the wasm path instead of the token
引入的语法包:传入wasm路径而非语言标识
node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
The frequency table + field names (`name:`, `parameters:`, `body:`,
`return_type:`) tell you what to map. Open the existing extractor closest to the
language's paradigm as a model: `rust.ts`/`scala.ts` (functional, traits),
`java.ts`/`csharp.ts` (OO), `python.ts`/`ruby.ts` (scripting), `go.ts`
(top-level methods + receivers).node scripts/add-lang/dump-ast.mjs src/extraction/wasm/tree-sitter-<lang>.wasm sample.<ext>
频率表+字段名(`name:`、`parameters:`、`body:`、`return_type:`)会告知你需要映射的内容。选择与该语言范式最接近的现有提取器作为模板:`rust.ts`/`scala.ts`(函数式、特征)、`java.ts`/`csharp.ts`(面向对象)、`python.ts`/`ruby.ts`(脚本语言)、`go.ts`(顶层方法+接收器)。Step 4 — Wire the language (4 files)
步骤4 — 接入语言(4个文件)
These are exact, fragile wiring — match the existing style precisely:
- — TWO edits:
src/types.ts- add to the
'<lang>',const (beforeLANGUAGES);'unknown' - add to
'**/*.<ext>',. Don't skip this — it's the file-scan allowlist; without the glob,DEFAULT_CONFIG.includefinds 0 files even though detection/extraction are wired.codegraph init
- add
- — three maps:
src/extraction/grammars.ts- :
WASM_GRAMMAR_FILES<lang>: 'tree-sitter-<lang>.wasm', - : each file extension →
EXTENSION_MAP(e.g.'<lang>')'.lua': 'lua', - :
getLanguageDisplayName<lang>: '<Display Name>', - vendored only: add to the
<lang>wasm-path branch.(lang === 'pascal' || lang === 'scala' || …)
- — new file exporting
src/extraction/languages/<lang>.ts. Map the node types from Step 3. Required fields:export const <lang>Extractor: LanguageExtractor = { … },functionTypes,classTypes,methodTypes,interfaceTypes,structTypes,enumTypes,typeAliasTypes,importTypes,callTypes,variableTypes,nameField,bodyField. Add hooks as the grammar needs them (paramsField,getSignature,getVisibility,isExported,extractImport,visitNode,getReceiverType,interfaceKind, etc. — seeenumMemberTypes).src/extraction/tree-sitter-types.ts - —
src/extraction/languages/index.tsand addimport { <lang>Extractor } from './<lang>';to<lang>: <lang>Extractor,.EXTRACTORS
Sometimes a 5th, core touch in — variable
extraction has per-language branches in (the generic fallback
only finds direct / children). If the grammar
nests declared names (e.g. Lua's ), add a
branch there, mirroring the existing
ts/python/go ones. Import forms that aren't a distinct node (Lua/Ruby
is a call) are handled in the extractor's hook instead.
src/extraction/tree-sitter.tsextractVariableidentifiervariable_declaratorvariable_declaration → variable_list} else if (this.language === '<lang>')requirevisitNode这些是精确且易出错的接入操作——需严格匹配现有代码风格:
- — 两处修改:
src/types.ts- 在常量中添加
LANGUAGES(位于'<lang>',之前);'unknown' - 在中添加
DEFAULT_CONFIG.include。请勿跳过此步骤——这是文件扫描的允许列表;若缺少该通配符,即使检测/提取已接入,'**/*.<ext>',也会找到0个文件。codegraph init
- 在
- — 三个映射:
src/extraction/grammars.ts- :
WASM_GRAMMAR_FILES<lang>: 'tree-sitter-<lang>.wasm', - : 每个文件扩展名映射到
EXTENSION_MAP(例如'<lang>')'.lua': 'lua', - :
getLanguageDisplayName<lang>: '<Display Name>', - 仅引入包需要:将添加到
<lang>的wasm路径分支中。(lang === 'pascal' || lang === 'scala' || …)
- — 新建文件,导出
src/extraction/languages/<lang>.ts。映射步骤3中的节点类型。必填字段:export const <lang>Extractor: LanguageExtractor = { … }、functionTypes、classTypes、methodTypes、interfaceTypes、structTypes、enumTypes、typeAliasTypes、importTypes、callTypes、variableTypes、nameField、bodyField。根据语法包需求添加钩子(paramsField、getSignature、getVisibility、isExported、extractImport、visitNode、getReceiverType、interfaceKind等——详见enumMemberTypes)。src/extraction/tree-sitter-types.ts - — 添加
src/extraction/languages/index.ts,并在import { <lang>Extractor } from './<lang>';中添加EXTRACTORS。<lang>: <lang>Extractor,
有时需要修改第5个核心文件——变量提取在中有按语言分支的逻辑(通用回退仅能找到直接的/子节点)。若语法包中声明的名称存在嵌套(例如Lua的),需在此处添加分支,镜像现有ts/python/go的实现。非独立节点的导入形式(Lua/Ruby的是一个调用)则在提取器的钩子中处理。
src/extraction/tree-sitter.tsextractVariableidentifiervariable_declaratorvariable_declaration → variable_list} else if (this.language === '<lang>')requirevisitNodeStep 5 — Build + verify loop
步骤5 — 构建+验证循环
bash
npm run build # tsc + copy-assets (copies any vendored *.wasm into dist/)Index a small sample repo and check extraction:
bash
( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>verify-extraction.mjsfileimportdump-ast.mjs<lang>.tsnpm run buildbash
npm run build # tsc + copy-assets(将所有引入的*.wasm复制到dist/)为小型样本仓库建立索引并检查提取结果:
bash
( cd <sample-repo> && codegraph init -i )
node scripts/add-lang/verify-extraction.mjs <sample-repo> <lang>若未检测到语言,或仅生成/节点,会失败(退出码1)——这是节点类型名称错误的典型症状。若失败或出现警告:在更丰富的文件上重新运行,修复中的映射,执行,重新建立索引,再次验证。重复直到验证通过。
fileimportverify-extraction.mjsdump-ast.mjs<lang>.tsnpm run buildStep 6 — Tests
步骤6 — 测试
Add to , modeled on the block:
__tests__/extraction.test.tsRust Extraction- a assertion in
detectLanguagedescribe('Language Detection') - a block asserting functions/classes/imports are extracted from an inline source string.
describe('<Lang> Extraction')
bash
npx vitest run __tests__/extraction.test.tsGreen before continuing.
在中添加测试,以块为模板:
__tests__/extraction.test.tsRust Extraction- 在中添加
describe('Language Detection')断言detectLanguage - 添加块,断言从内联源字符串中提取出函数/类/导入
describe('<Lang> Extraction')
bash
npx vitest run __tests__/extraction.test.ts确保测试通过后再继续。
Step 7 — Auto-pick 3 repos + corpus
步骤7 — 自动选择3个仓库+语料库
Pick without asking. Find candidates, then curate 3 that are genuinely
-dominant, one per size tier:
<lang>bash
gh search repos --language=<lang> --sort=stars --limit 40 \
--json fullName,stargazerCount,descriptionTiers (match ): Small <~150 files · Medium ~150–1500 ·
Large >~1500. Skip repos that are tagged but mostly another
language. Write one cross-file architecture question per repo (the kind that
needs tracing across files). Add a block to
(fields: , , ,
, ) so can reuse them.
corpus.json<lang>"<Language>".claude/skills/agent-eval/corpus.jsonnamereposizefilesquestion/agent-eval无需询问,直接选择。找到候选仓库后,筛选出3个真正以为主的仓库,覆盖三个规模层级:
<lang>bash
gh search repos --language=<lang> --sort=stars --limit 40 \\
--json fullName,stargazerCount,description规模层级(匹配):小型 <~150个文件 · 中型 ~150–1500个文件 · 大型 >~1500个文件。跳过标记为但主要使用其他语言的仓库。为每个仓库编写一个跨文件架构问题(需要跨文件追踪的类型)。在中添加块(字段:、、、、),以便可以复用这些内容。
corpus.json<lang>.claude/skills/agent-eval/corpus.json"<Language>"namereposizefilesquestion/agent-evalStep 8 — Benchmark all 3 (extraction + A/B)
步骤8 — 对3个仓库进行基准测试(提取+A/B对比)
Make the dev build the codegraph on PATH once, then loop:
bash
npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # ×3bench.sh/tmp/codegraph-corpusverify-extraction.mjsscripts/agent-eval/run-all.shparse-run.mjsrun-all.shReadwithwithout./scripts/local-install.sh先将开发构建版本设置为PATH中的codegraph,然后循环执行:
bash
npm run build && ./scripts/local-install.sh
scripts/add-lang/bench.sh <lang> <name> <url> "<question>" headless # 执行3次bench.sh/tmp/codegraph-corpusverify-extraction.mjsscripts/agent-eval/run-all.shrun-all.shparse-run.mjs使用CodeGraph不使用CodeGraph./scripts/local-install.shStep 9 — Docs + CHANGELOG
步骤9 — 文档+CHANGELOG
- README.md: add to the "19+ Languages" feature bullet, and add a row to the Supported Languages table:
<Lang>.ext` | Full support (classes, methods, …) |`.| <Lang> | \ - CHANGELOG.md: add an section at the top (above the latest version) with
## [Unreleased]→ a user-perspective bullet, e.g. "CodeGraph now indexes <Lang> (### Added) — functions, classes, imports, and call edges." If.extalready exists, append under it. (It's folded into the next versioned block at release time.)## [Unreleased]
- README.md: 将添加到“19+ Languages”功能项目符号中,并在支持的语言表格中添加一行:
<Lang>.ext\| <Lang> | \\。| 完整支持(类、方法等) | - CHANGELOG.md: 在顶部添加章节(位于最新版本上方),在
## [Unreleased]下添加用户视角的项目符号,例如:"CodeGraph现在支持索引**<Lang>**(### Added)——包括函数、类、导入和调用边。" 若.ext已存在,则追加到该章节下。(发布时会合并到下一个版本块中。)## [Unreleased]
Step 10 — Report (do NOT commit)
步骤10 — 提交报告(禁止提交代码)
Summarize for review:
- Files changed: the 4 wiring edits + new extractor + tests + README +
CHANGELOG + corpus.json (+ any vendored ).
.wasm - Extraction per repo: files / nodes / edges / result.
verify-extraction - A/B per repo: vs
with(tool calls, file Reads, cost) and a one-line verdict — did codegraph reduce effort, and did both arms reach a correct answer?without - Gaps / follow-ups (node types not yet mapped, resolution edges missing, framework routes, etc.).
Hand the changes to the user. Do not run / or publish —
releases go through the GitHub Actions Release workflow.
git commitpush为审核总结以下内容:
- 修改的文件: 4处接入修改 + 新提取器 + 测试 + README + CHANGELOG + corpus.json(+任何引入的)。
.wasm - 每个仓库的提取结果: 文件数/节点数/边数/结果。
verify-extraction - 每个仓库的A/B对比: vs
使用CodeGraph(工具调用、文件读取、成本),以及一行结论——CodeGraph是否减少了工作量,两组是否都得出了正确答案?不使用CodeGraph - 差距/后续工作(未映射的节点类型、缺失的解析边、框架路由等)。
将修改内容提交给用户。禁止运行/或发布——发布需通过GitHub Actions Release工作流。
git commitpushNotes
注意事项
- The A/B spawns real paid runs (opus,
claude -p), 2 arms × 3 repos. The corpus dir--max-budget-usdis shared with/tmp/codegraph-corpus, so clones are reused across runs./agent-eval - Any new must live in
*.wasm—src/extraction/wasm/(run bycopy-assets) ships it; otherwise it won't be innpm run build.dist/ - An index must be served by the same binary that built it. Step 8 builds + links the dev build first, so this holds.
- If a grammar can't be obtained, or extraction can't reach PASS, STOP and report — don't ship a half-wired language.
- A/B对比会启动真实的付费运行实例(opus,
claude -p),3个仓库×2组对比。语料库目录--max-budget-usd与/tmp/codegraph-corpus共享,因此克隆的仓库会在多次运行中复用。/agent-eval - 所有新的必须放在
*.wasm中——src/extraction/wasm/(由copy-assets调用)会将其打包;否则不会出现在npm run build中。dist/ - 索引必须由构建它的同一二进制文件提供服务。步骤8会先构建并链接开发版本,因此可满足此要求。
- 若无法获取语法包,或提取无法通过验证,请停止操作并报告——不要交付半接入的语言支持。",