finding-duplicate-functions

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Finding Duplicate-Intent Functions

检测功能意图重复的函数

Overview

概述

LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."

This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.

LLM生成的代码库会积累语义重复的代码：即功能目的相同但独立实现的函数。传统的复制粘贴检测工具（如jscpd）只能检测语法重复，却无法识别「意图相同、实现不同」的情况。

本技能采用两阶段方法：先进行传统提取，再通过LLM驱动的意图聚类来检测。

When to Use

适用场景

Codebase has grown organically with multiple contributors (human or LLM)
You suspect utility functions have been reimplemented multiple times
Before major refactoring to identify consolidation opportunities
After jscpd has been run and syntactic duplicates are already handled

代码库由多名贡献者（人类或LLM）逐步开发而成
你怀疑工具函数被多次重复实现
大型重构前，用于识别可合并的代码机会
已运行jscpd处理完语法重复后的后续检测

Quick Reference

快速参考

Phase	Tool	Model	Output
1. Extract	`scripts/extract-functions.sh`	-	`catalog.json`
2. Categorize	`scripts/categorize-prompt.md`	haiku	`categorized.json`
3. Split	`scripts/prepare-category-analysis.sh`	-	`categories/*.json`
4. Detect	`scripts/find-duplicates-prompt.md`	opus	`duplicates/*.json`
5. Report	`scripts/generate-report.sh`	-	`report.md`

阶段	工具	模型	输出
1. 提取	`scripts/extract-functions.sh`	-	`catalog.json`
2. 分类	`scripts/categorize-prompt.md`	haiku	`categorized.json`
3. 拆分	`scripts/prepare-category-analysis.sh`	-	`categories/*.json`
4. 检测	`scripts/find-duplicates-prompt.md`	opus	`duplicates/*.json`
5. 生成报告	`scripts/generate-report.sh`	-	`report.md`

Process

流程

dot

digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}

dot

digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}

Phase 1: Extract Function Catalog

阶段1：提取函数目录

bash

./scripts/extract-functions.sh src/ -o catalog.json

Options:

```
-o FILE
```
: Output file (default: stdout)
```
-c N
```
: Lines of context to capture (default: 15)
```
-t GLOB
```
: File types (default:
```
*.ts,*.tsx,*.js,*.jsx
```
)
```
--include-tests
```
: Include test files (excluded by default)

Test files (

*.test.*

*.spec.*

__tests__/**

) are excluded by default since test utilities are less likely to be consolidation candidates.

bash

./scripts/extract-functions.sh src/ -o catalog.json

选项：

```
-o FILE
```
: 输出文件（默认：标准输出）
```
-c N
```
: 捕获的上下文行数（默认：15）
```
-t GLOB
```
: 文件类型（默认：
```
*.ts,*.tsx,*.js,*.jsx
```
）
```
--include-tests
```
: 包含测试文件（默认排除）

测试文件（

*.test.*

*.spec.*

__tests__/**

）默认被排除，因为测试工具函数通常不是合并的优先候选。

Phase 2: Categorize by Domain

阶段2：按领域分类

Dispatch a haiku subagent using the prompt in

scripts/categorize-prompt.md

Insert the contents of

catalog.json

where indicated in the prompt template. Save output as

categorized.json

使用

scripts/categorize-prompt.md

中的提示词调用haiku子代理。

将

catalog.json

的内容插入到提示模板的指定位置，输出保存为

categorized.json

。

Phase 3: Split into Categories

阶段3：拆分为分类文件

bash

./scripts/prepare-category-analysis.sh categorized.json ./categories

Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.

bash

./scripts/prepare-category-analysis.sh categorized.json ./categories

为每个分类创建一个JSON文件。只有包含3个及以上函数的分类才值得分析。

Phase 4: Find Duplicates (Per Category)

阶段4：按分类检测重复项

For each category file in

./categories/

, dispatch an opus subagent using the prompt in

scripts/find-duplicates-prompt.md

Save each output as

./duplicates/{category}.json

对于

./categories/

中的每个分类文件，使用

scripts/find-duplicates-prompt.md

中的提示词调用opus子代理。

将每个输出保存为

./duplicates/{category}.json

。

Phase 5: Generate Report

阶段5：生成报告

bash

./scripts/generate-report.sh ./duplicates ./duplicates-report.md

Produces a prioritized markdown report grouped by confidence level.

bash

./scripts/generate-report.sh ./duplicates ./duplicates-report.md

生成按置信度优先级分组的Markdown报告。

Phase 6: Human Review

阶段6：人工审核

Review the report. For HIGH confidence duplicates:

Verify the recommended survivor has tests
Update callers to use the survivor
Delete the duplicates
Run tests

审核报告内容。对于高置信度的重复项：

确认推荐保留的函数有对应的测试用例
更新所有调用处，改为使用保留的函数
删除重复的函数
运行测试

High-Risk Duplicate Zones

高风险重复区域

Focus extraction on these areas first - they accumulate duplicates fastest:

Zone	Common Duplicates
`utils/` , `helpers/` , `lib/`	General utilities reimplemented
Validation code	Same checks written multiple ways
Error formatting	Error-to-string conversions
Path manipulation	Joining, resolving, normalizing paths
String formatting	Case conversion, truncation, escaping
Date formatting	Same formats implemented repeatedly
API response shaping	Similar transformations for different endpoints

优先在以下区域提取检测——这些区域最容易积累重复代码：

区域	常见重复类型
`utils/` , `helpers/` , `lib/`	通用工具函数被重复实现
验证代码	相同的校验逻辑被多次编写
错误格式化	错误信息转字符串的逻辑重复
路径处理	路径拼接、解析、标准化逻辑重复
字符串格式化	大小写转换、截断、转义逻辑重复
日期格式化	相同的日期格式被重复实现
API响应格式化	不同接口使用相似的响应转换逻辑

Common Mistakes

常见误区

Extracting too much: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.

Skipping the categorization step: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.

Using haiku for duplicate detection: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.

Consolidating without tests: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.

提取过多内容：重点关注导出函数和公共方法。内部辅助函数跨文件重复的可能性较低。

跳过分类步骤：直接对完整目录进行重复检测会产生大量无效结果。分类可以聚焦对比范围。

使用haiku进行重复检测：Haiku在分类任务中性价比高，但会遗漏细微的语义重复。实际重复分析应使用Opus。

未测试就合并：删除重复函数前，确保保留的函数有覆盖所有被删除函数用例的测试。