finding-duplicate-functions

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Finding Duplicate-Intent Functions

检测功能意图重复的函数

Overview

概述

LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."
This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.
LLM生成的代码库会积累语义重复的代码:即功能目的相同但独立实现的函数。传统的复制粘贴检测工具(如jscpd)只能检测语法重复,却无法识别「意图相同、实现不同」的情况。
本技能采用两阶段方法:先进行传统提取,再通过LLM驱动的意图聚类来检测。

When to Use

适用场景

  • Codebase has grown organically with multiple contributors (human or LLM)
  • You suspect utility functions have been reimplemented multiple times
  • Before major refactoring to identify consolidation opportunities
  • After jscpd has been run and syntactic duplicates are already handled
  • 代码库由多名贡献者(人类或LLM)逐步开发而成
  • 你怀疑工具函数被多次重复实现
  • 大型重构前,用于识别可合并的代码机会
  • 已运行jscpd处理完语法重复后的后续检测

Quick Reference

快速参考

PhaseToolModelOutput
1. Extract
scripts/extract-functions.sh
-
catalog.json
2. Categorize
scripts/categorize-prompt.md
haiku
categorized.json
3. Split
scripts/prepare-category-analysis.sh
-
categories/*.json
4. Detect
scripts/find-duplicates-prompt.md
opus
duplicates/*.json
5. Report
scripts/generate-report.sh
-
report.md
阶段工具模型输出
1. 提取
scripts/extract-functions.sh
-
catalog.json
2. 分类
scripts/categorize-prompt.md
haiku
categorized.json
3. 拆分
scripts/prepare-category-analysis.sh
-
categories/*.json
4. 检测
scripts/find-duplicates-prompt.md
opus
duplicates/*.json
5. 生成报告
scripts/generate-report.sh
-
report.md

Process

流程

dot
digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}
dot
digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}

Phase 1: Extract Function Catalog

阶段1:提取函数目录

bash
./scripts/extract-functions.sh src/ -o catalog.json
Options:
  • -o FILE
    : Output file (default: stdout)
  • -c N
    : Lines of context to capture (default: 15)
  • -t GLOB
    : File types (default:
    *.ts,*.tsx,*.js,*.jsx
    )
  • --include-tests
    : Include test files (excluded by default)
Test files (
*.test.*
,
*.spec.*
,
__tests__/**
) are excluded by default since test utilities are less likely to be consolidation candidates.
bash
./scripts/extract-functions.sh src/ -o catalog.json
选项:
  • -o FILE
    : 输出文件(默认:标准输出)
  • -c N
    : 捕获的上下文行数(默认:15)
  • -t GLOB
    : 文件类型(默认:
    *.ts,*.tsx,*.js,*.jsx
  • --include-tests
    : 包含测试文件(默认排除)
测试文件(
*.test.*
,
*.spec.*
,
__tests__/**
)默认被排除,因为测试工具函数通常不是合并的优先候选。

Phase 2: Categorize by Domain

阶段2:按领域分类

Dispatch a haiku subagent using the prompt in
scripts/categorize-prompt.md
.
Insert the contents of
catalog.json
where indicated in the prompt template. Save output as
categorized.json
.
使用
scripts/categorize-prompt.md
中的提示词调用haiku子代理。
catalog.json
的内容插入到提示模板的指定位置,输出保存为
categorized.json

Phase 3: Split into Categories

阶段3:拆分为分类文件

bash
./scripts/prepare-category-analysis.sh categorized.json ./categories
Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.
bash
./scripts/prepare-category-analysis.sh categorized.json ./categories
为每个分类创建一个JSON文件。只有包含3个及以上函数的分类才值得分析。

Phase 4: Find Duplicates (Per Category)

阶段4:按分类检测重复项

For each category file in
./categories/
, dispatch an opus subagent using the prompt in
scripts/find-duplicates-prompt.md
.
Save each output as
./duplicates/{category}.json
.
对于
./categories/
中的每个分类文件,使用
scripts/find-duplicates-prompt.md
中的提示词调用opus子代理。
将每个输出保存为
./duplicates/{category}.json

Phase 5: Generate Report

阶段5:生成报告

bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.md
Produces a prioritized markdown report grouped by confidence level.
bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.md
生成按置信度优先级分组的Markdown报告。

Phase 6: Human Review

阶段6:人工审核

Review the report. For HIGH confidence duplicates:
  1. Verify the recommended survivor has tests
  2. Update callers to use the survivor
  3. Delete the duplicates
  4. Run tests
审核报告内容。对于高置信度的重复项:
  1. 确认推荐保留的函数有对应的测试用例
  2. 更新所有调用处,改为使用保留的函数
  3. 删除重复的函数
  4. 运行测试

High-Risk Duplicate Zones

高风险重复区域

Focus extraction on these areas first - they accumulate duplicates fastest:
ZoneCommon Duplicates
utils/
,
helpers/
,
lib/
General utilities reimplemented
Validation codeSame checks written multiple ways
Error formattingError-to-string conversions
Path manipulationJoining, resolving, normalizing paths
String formattingCase conversion, truncation, escaping
Date formattingSame formats implemented repeatedly
API response shapingSimilar transformations for different endpoints
优先在以下区域提取检测——这些区域最容易积累重复代码:
区域常见重复类型
utils/
,
helpers/
,
lib/
通用工具函数被重复实现
验证代码相同的校验逻辑被多次编写
错误格式化错误信息转字符串的逻辑重复
路径处理路径拼接、解析、标准化逻辑重复
字符串格式化大小写转换、截断、转义逻辑重复
日期格式化相同的日期格式被重复实现
API响应格式化不同接口使用相似的响应转换逻辑

Common Mistakes

常见误区

Extracting too much: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.
Skipping the categorization step: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.
Using haiku for duplicate detection: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.
Consolidating without tests: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.
提取过多内容:重点关注导出函数和公共方法。内部辅助函数跨文件重复的可能性较低。
跳过分类步骤:直接对完整目录进行重复检测会产生大量无效结果。分类可以聚焦对比范围。
使用haiku进行重复检测:Haiku在分类任务中性价比高,但会遗漏细微的语义重复。实际重复分析应使用Opus。
未测试就合并:删除重复函数前,确保保留的函数有覆盖所有被删除函数用例的测试。