wiki-dedup

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Wiki Dedup — Identity Resolution and Page-Level Deduplication

Wiki 去重 — 标识解析与页面级重复数据删除

You are finding and merging wiki pages that cover the same concept under different names. This is a write-heavy, potentially destructive skill — page merges cannot be automatically undone. Work carefully and confirm before acting in merge mode.

Follow the Retrieval Primitives table in
llm-wiki/SKILL.md
. The candidate-detection pass uses only frontmatter and titles (cheap). Only open full page bodies for confirmed candidate pairs.

你需要查找并合并以不同名称涵盖同一概念的wiki页面。这是一项写入密集型、具有潜在破坏性的技能——页面合并无法自动撤销。操作时需谨慎，在合并模式下执行操作前务必确认。

**遵循

llm-wiki/SKILL.md

中的检索原语表。**候选检测阶段仅使用前置元数据和标题（成本较低）。仅为已确认的候选对打开完整页面内容。

Before You Start

开始之前

Resolve config — follow the Config Resolution Protocol in
```
llm-wiki/SKILL.md
```
(walk up CWD for
```
.env
```
→
```
~/.obsidian-wiki/config
```
→ prompt setup). This gives
```
OBSIDIAN_VAULT_PATH
```
and
```
OBSIDIAN_LINK_FORMAT
```
.
Read
```
index.md
```
to get the full page inventory with one-line descriptions and tags.
Read
```
log.md
```
briefly — if a dedup run just happened, note what was already merged.

解析配置 — 遵循
```
llm-wiki/SKILL.md
```
中的配置解析协议（从当前工作目录向上查找
```
.env
```
→
```
~/.obsidian-wiki/config
```
→ 提示设置）。这将获取
```
OBSIDIAN_VAULT_PATH
```
和
```
OBSIDIAN_LINK_FORMAT
```
。
读取
```
index.md
```
以获取包含单行描述和标签的完整页面清单。
简要阅读
```
log.md
```
— 如果刚刚执行过去重操作，记录已合并的内容。

Modes

模式

Mode	Flag	Behavior
Audit	(default)	Report candidates only — no writes
Merge	`--merge`	Show each confirmed pair, ask for confirmation before merging
Auto-merge	`--auto`	Merge all high-confidence pairs ( `score ≥ 0.90` ) non-interactively

If the user doesn't specify, run in Audit mode and present findings before asking whether to proceed.

模式	标识	行为
审计	(默认)	仅报告候选对 — 不执行写入操作
合并	`--merge`	显示每个已确认的候选对，合并前请求确认
自动合并	`--auto`	以非交互方式合并所有高置信度候选对（ `score ≥ 0.90` ）

如果用户未指定模式，则以审计模式运行，先呈现结果再询问是否继续。

Step 1: Build the Page Registry

步骤1：构建页面注册表

Glob all

.md

files in the vault (excluding

_archives/

_raw/

.obsidian/

index.md

log.md

hot.md

_insights.md

, and any file that contains

redirects_to:

in its frontmatter — those are already merged redirect stubs).

For each remaining page, extract from frontmatter:

```
node_id
```
— relative path from vault root, without
```
.md
```
```
title
```
— frontmatter
```
title
```
field
```
aliases
```
— frontmatter
```
aliases
```
list (may be absent)
```
tags
```
— frontmatter
```
tags
```
list
```
category
```
— directory prefix

Build a lookup table:

node_id → {title, aliases, tags, category, summary}

遍历vault中的所有

.md

文件（排除

_archives/

、

_raw/

、

.obsidian/

、

index.md

、

log.md

、

hot.md

、

_insights.md

，以及前置元数据中包含

redirects_to:

的文件——这些是已合并的重定向存根）。

对于每个剩余页面，从前置元数据中提取：

```
node_id
```
— 相对于vault根目录的路径，不含
```
.md
```
后缀
```
title
```
— 前置元数据中的
```
title
```
字段
```
aliases
```
— 前置元数据中的
```
aliases
```
列表（可能不存在）
```
tags
```
— 前置元数据中的
```
tags
```
列表
```
category
```
— 目录前缀

构建查找表：

node_id → {title, aliases, tags, category, summary}

。

Step 2: Detect Candidate Pairs

步骤2：检测候选对

For every pair of pages in the registry, compute a similarity score using these signals:

对于注册表中的每一对页面，使用以下信号计算相似度得分：

2a. Title similarity signals

2a. 标题相似度信号

Signal	How to assess	Max contribution
Token overlap	Jaccard similarity of lowercased title word-tokens (split on spaces, hyphens, underscores, punctuation)	0.65
Edit distance	Normalized edit distance on lowercased titles: `1 - (edits / max(len_a, len_b))`	0.40
Substring containment	One title is a substring of the other (e.g. "RSC" ⊂ "React Server Components")	0.50
Alias cross-match	Page A's title appears in page B's `aliases` , or vice versa	0.65

Composite title score =

min(max(token_overlap, edit_distance, substring), 0.65) + alias_cross_bonus

You don't need exact arithmetic — make a confident judgement about degree of similarity.

Title extraction note: Some pages use YAML block scalars (

title: >-

title: |

). When the

title:

value is

>-

, or

|-

, the actual title is on the next indented line — read it from there. Never compare the literal string

>-

as a title.

信号	评估方式	最大贡献值
Token重叠	小写标题词Token的Jaccard相似度（按空格、连字符、下划线、标点符号拆分）	0.65
编辑距离	小写标题的标准化编辑距离： `1 - (编辑次数 / max(长度a, 长度b))`	0.40
子字符串包含	一个标题是另一个标题的子字符串（例如“RSC” ⊂ “React Server Components”）	0.50
别名交叉匹配	页面A的标题出现在页面B的 `aliases` 中，反之亦然	0.65

复合标题得分 =

min(max(token_overlap, edit_distance, substring), 0.65) + alias_cross_bonus

。

无需精确计算——只需对相似程度做出可靠判断即可。

**标题提取注意事项：**部分页面使用YAML块标量（

title: >-

或

title: |

）。当

title:

的值为

>-

、

或

|-

时，实际标题在下一行缩进位置——从此处读取。切勿将字面字符串

>-

作为标题进行比较。

2b. Semantic signals (cheap pass)

2b. 语义信号（快速检测）

Signal	Points
Same `category` directory	+0.10
Tag overlap ≥ 3 shared tags	+0.15
Tag overlap ≥ 2 shared tags	+0.05
Same first tag (dominant tag)	+0.05

信号	加分
同一 `category` 目录	+0.10
标签重叠≥3个共享标签	+0.15
标签重叠≥2个共享标签	+0.05
第一个标签相同（主导标签）	+0.05

2c. Threshold

2c. 阈值

Flag pairs with composite score ≥ 0.75 as candidates. Pairs scoring 0.90+ are high-confidence.

Score ranges → confidence labels:

Score	Label
≥ 0.90	HIGH — almost certainly the same concept
0.75–0.89	MEDIUM — likely the same, verify
0.60–0.74	LOW — possible abbreviation or specialisation; skip unless user asks

Only carry HIGH and MEDIUM candidates into Step 3.

将复合得分≥0.75的对标记为候选对。得分0.90+的对为高置信度。

得分范围→置信度标签：

得分	标签
≥ 0.90	HIGH — 几乎可以确定是同一概念
0.75–0.89	MEDIUM — 可能是同一概念，需验证
0.60–0.74	LOW — 可能是缩写或特化内容；除非用户要求，否则跳过

仅将HIGH和MEDIUM级别的候选对带入步骤3。

2d. Quick exit rule

2d. 快速退出规则

If the vault has fewer than 10 pages, skip the pair loop and report "vault too small to have meaningful duplicates". If the vault has more than 500 pages, process candidates in batches of 50 pairs — pause and report progress between batches.

如果vault中的页面少于10个，跳过配对循环并报告“vault过小，不存在有意义的重复项”。如果vault中的页面超过500个，以50对为一批处理候选对——每批之间暂停并报告进度。

Step 3: Semantic Verdict

步骤3：语义判定

For each candidate pair (sorted by score descending):

Read both pages in full (full page read — justified because candidate pool is small).
Ask: are these pages covering the same concept, or are they distinct?

Assign one of three verdicts:

Verdict	Meaning
`merge`	Same concept — different name, abbreviation, alias, or accidental duplicate. Safe to merge.
`keep-separate`	Related but distinct — e.g. "Server Actions" vs "Server Components" are related React features, not duplicates.
`needs-review`	Ambiguous — substantial overlap but also meaningful differences. Flag for the user to decide.

Attach a short reason to each verdict (one sentence). This appears in the report and the log.

对于每个候选对（按得分降序排序）：

完整读取两个页面（读取完整页面是合理的，因为候选池规模较小）。
提问：这些页面是否涵盖同一概念，还是彼此不同？

分配以下三种判定之一：

判定	含义
`merge`	同一概念——不同名称、缩写、别名或意外重复。可以安全合并。
`keep-separate`	相关但不同——例如“Server Actions”与“Server Components”是相关的React特性，并非重复项。
`needs-review`	模糊不清——存在大量重叠但也有显著差异。标记为需用户决策。

为每个判定附加简短理由（一句话）。该理由将出现在报告和日志中。

Step 4: Audit Report

步骤4：审计报告

Always produce this report, even in merge/auto-merge mode (so the user sees what will happen):

markdown

undefined

无论处于合并/自动合并模式，始终生成此报告（以便用户了解将要执行的操作）：

markdown

undefined

Wiki Dedup Report

Wiki 去重报告

High-Confidence Candidates (score ≥ 0.90): N pairs

高置信度候选对（得分≥0.90）：N对

Score	Page A	Page B	Verdict	Reason
0.95	`concepts/rsc.md`	`concepts/react-server-components.md`	merge	"RSC" is the abbreviation; both pages cover identical material
0.91	`entities/vaswani-2017.md`	`references/attention-is-all-you-need.md`	keep-separate	One is a person stub, one is a paper reference

得分	页面A	页面B	判定	理由
0.95	`concepts/rsc.md`	`concepts/react-server-components.md`	merge	"RSC"是缩写；两个页面内容完全一致
0.91	`entities/vaswani-2017.md`	`references/attention-is-all-you-need.md`	keep-separate	一个是人物存根，一个是论文参考

Medium-Confidence Candidates (score 0.75–0.89): N pairs

中等置信度候选对（得分0.75–0.89）：N对

Score	Page A	Page B	Verdict	Reason
0.82	`concepts/fine-tuning.md`	`concepts/finetuning.md`	merge	Same concept, hyphenation variant

得分	页面A	页面B	判定	理由
0.82	`concepts/fine-tuning.md`	`concepts/finetuning.md`	merge	同一概念，连字符变体

Needs Human Review: N pairs

需人工审核：N对

Score	Page A	Page B	Reason
0.78	`concepts/agents.md`	`concepts/autonomous-agents.md`	Substantial overlap but "agents" may intentionally be broader

得分	页面A	页面B	理由
0.78	`concepts/agents.md`	`concepts/autonomous-agents.md`	存在大量重叠，但“agents”可能有意涵盖更广泛内容

Summary

摘要

Pages scanned: N
Candidate pairs found: M
Recommended merges: X
Keep separate: Y
Needs review: Z


In **Audit mode**, stop here and ask: "Run `--merge` to interactively merge the recommended pairs, or `--auto` to merge all high-confidence ones automatically?"

扫描页面数：N
发现候选对：M
建议合并数：X
建议保留独立数：Y
需审核数：Z


在**审计模式**下，到此为止并询问：“是否运行`--merge`以交互式合并建议的候选对，或运行`--auto`自动合并所有高置信度候选对？”

Step 5: Merge

步骤5：合并

For each

merge

verdict pair (in merge or auto-merge mode):

In merge mode: show the pair and verdict, then ask: "Merge

[Page A]

into

[Page B]

? (yes/skip/review)". Skip on anything other than yes.

In auto-merge mode: only process HIGH-confidence (

score ≥ 0.90

) merges without prompting.

对于每个判定为

merge

的候选对（合并或自动合并模式下）：

在合并模式下：显示候选对和判定，然后询问：“是否将

[页面A]

合并到

[页面B]

中？（是/跳过/审核）”。非“是”则跳过。

在自动合并模式下：仅处理高置信度（

score ≥ 0.90

）的合并，无需提示。

5a: Pick the canonical page

5a：选择标准页面

Apply these tiebreakers in order until one wins:

More incoming wikilinks — grep the vault for
```
[[node_id]]
```
references; higher count wins
Richer content — longer page body (more lines) wins
More sources — larger
```
sources:
```
list wins
Title length — longer, more descriptive title wins (e.g. "React Server Components" beats "RSC")
Alphabetical — earlier title wins

The canonical page is the survivor. The other page becomes the secondary (to be merged in, then replaced with a redirect stub).

按以下顺序应用平局规则，直到选出获胜者：

传入wikilinks更多 — 在vault中搜索
```
[[node_id]]
```
引用；计数更高者获胜
内容更丰富 — 页面正文更长（行数更多）者获胜
来源更多 —
```
sources:
```
列表更长者获胜
标题长度 — 更长、描述性更强的标题获胜（例如“React Server Components”优于“RSC”）
字母顺序 — 标题字母顺序更靠前的获胜

标准页面是保留页。另一个页面成为次要页（将被合并，然后替换为重定向存根）。

5b: Merge content into the canonical page

5b：将内容合并到标准页面

Read both pages. Update the canonical page:

aliases:
— add secondary page's title and all its aliases (no duplicates)
tags:
— merge both tag lists (deduplicate, cap at 5 domain tags + system tags)
sources:
— merge both source lists (deduplicate)
relationships:
— merge both relationship lists (deduplicate by target, prefer typed entries over untyped)
base_confidence
— recompute using the union of sources and the formula from
```
llm-wiki/SKILL.md
```
updated
— set to now
summary:
— rewrite to cover the merged scope if the secondary page added new ground
Body content — merge unique sections and bullets from the secondary page. Do not blindly append — integrate the content. Avoid duplicating claims already present in the canonical page. Use
```
^[inferred]
```
markers where synthesis is needed.
provenance:
— recompute after merging

读取两个页面。更新标准页面：

aliases:
— 添加次要页的标题及其所有别名（无重复）
tags:
— 合并两个标签列表（去重，限制为5个领域标签+系统标签）
sources:
— 合并两个来源列表（去重）
relationships:
— 合并两个关系列表（按目标去重，优先选择带类型的条目而非无类型条目）
base_confidence
— 使用来源的并集和
```
llm-wiki/SKILL.md
```
中的公式重新计算
updated
— 设置为当前时间
summary:
— 如果次要页添加了新内容，重写摘要以涵盖合并后的范围
正文内容 — 合并次要页中独有的章节和项目符号。不要盲目追加——整合内容。避免重复标准页面中已有的内容。需要合成时使用
```
^[inferred]
```
标记。
provenance:
— 合并后重新计算

5c: Write a redirect stub at the secondary page path

5c：在次要页路径写入重定向存根

markdown

---
title: <secondary page title>
redirects_to: "[[<canonical node_id>]]"
aliases: [<secondary aliases>]
category: <secondary category>
tags: []
created: <secondary original created>
updated: <ISO timestamp now>
---

This page has been merged into [[<canonical page title>]].

The

redirects_to:

field tells any skill reading this page to follow the redirect rather than treat it as content.

markdown

---
title: <次要页标题>
redirects_to: "[[<标准node_id>]]"
aliases: [<次要页别名>]
category: <次要页分类>
tags: []
created: <次要页原始创建时间>
updated: <ISO当前时间戳>
---

此页面已合并到[[<标准页标题>]]。

redirects_to:

字段指示任何读取此页面的技能跟随重定向，而非将其视为内容页。

5d: Rewrite wikilinks vault-wide

5d：全局重写wikilinks

Grep the entire vault for any link pointing at the secondary slug:

```
[[secondary-slug]]
```
→
```
[[canonical-slug]]
```

[[secondary-slug|display text]]

→

[[canonical-slug|display text]]

OBSIDIAN_LINK_FORMAT=markdown

[text](../path/to/secondary.md)

→

[text](../path/to/canonical.md)

Safety rules:

Never rewrite inside code blocks (``` fences or
```
inline code
```
)
Never rewrite inside the redirect stub itself (that's the one place the old slug should remain legible)
Never use
```
rm
```
or destructive shell ops — only Edit/Write tools
Rewrite one file at a time, verifying each before moving on
If a file has zero occurrences, skip it

在整个vault中搜索指向次要页slug的链接：

```
[[secondary-slug]]
```
→
```
[[canonical-slug]]
```

[[secondary-slug|显示文本]]

→

[[canonical-slug|显示文本]]

如果

OBSIDIAN_LINK_FORMAT=markdown

：

[text](../path/to/secondary.md)

→

[text](../path/to/canonical.md)

安全规则：

切勿重写代码块内的内容（```围栏或
```
行内代码
```
）
切勿重写重定向存根本身（这是唯一应保留旧slug可读性的地方）
切勿使用
```
rm
```
或破坏性shell操作——仅使用编辑/写入工具
一次重写一个文件，每次重写后验证再继续
如果文件中无匹配项，跳过该文件

5e: Update tracking files

5e：更新跟踪文件

index.md
— Remove the secondary page's entry. Update the canonical page's entry with the merged summary.

.manifest.json
— For the secondary page's source entries: add

"merged_into": "<canonical node_id>"

to each. For the canonical page: merge in the secondary's

pages_created

and

pages_updated

lists.

hot.md
— Update Recent Activity: "Merged N duplicate pairs; canonical pages updated."

index.md
— 删除次要页的条目。使用合并后的摘要更新标准页的条目。

.manifest.json
— 对于次要页的来源条目：为每个条目添加

"merged_into": "<标准node_id>"

。对于标准页：合并次要页的

pages_created

和

pages_updated

列表。

hot.md
— 更新近期活动：“合并了N对重复项；标准页面已更新。”

5f: Final check

5f：最终检查

After all merges, grep the vault for any remaining

[[secondary-slug]]

references (in non-stub files). If any survive, report them — the rewrite step may have missed a non-standard link format.

所有合并完成后，在vault中搜索是否还有剩余的

[[secondary-slug]]

引用（非存根文件中）。如果有任何残留，报告这些引用——重写步骤可能遗漏了非标准链接格式。

Step 6: Log

步骤6：日志

Append to

log.md

- [TIMESTAMP] DEDUP mode=audit|merge|auto-merge pages_scanned=N pairs_found=M merged=X kept_separate=Y needs_review=Z wikilinks_rewritten=W

向

log.md

追加内容：

- [时间戳] DEDUP mode=audit|merge|auto-merge pages_scanned=N pairs_found=M merged=X kept_separate=Y needs_review=Z wikilinks_rewritten=W

Redirect Stub Handling

重定向存根处理

Other skills should handle redirect stubs as follows:

wiki-export
— skip pages with
```
redirects_to:
```
in frontmatter; they are not content nodes
wiki-query
— if a search hits a redirect stub, follow
```
redirects_to:
```
and read the canonical page instead
wiki-lint
— validate that every
```
redirects_to:
```
wikilink resolves to an existing, non-stub page (a redirect chain — stub pointing to stub — is an error)
cross-linker
— treat redirect stubs as non-targets; never add a new
```
[[wikilink]]
```
pointing at a stub page

其他技能应按以下方式处理重定向存根：

wiki-export
— 跳过前置元数据中包含
```
redirects_to:
```
的页面；它们不是内容节点
wiki-query
— 如果搜索命中重定向存根，跟随
```
redirects_to:
```
并读取标准页面
wiki-lint
— 验证每个
```
redirects_to:
```
wikilink是否指向存在的非存根页面（重定向链——存根指向存根——是错误）
cross-linker
— 将重定向存根视为非目标；切勿添加指向存根页面的新
```
[[wikilink]]
```

Tips

提示

Audit first, always. Even in auto-merge mode, the audit report is shown. Read it before trusting the results.
Check
needs-review
last. These are the hard cases — don't batch them with obvious merges.
Abbreviations are the most common case. "GPT" / "GPT-4" / "GPT4", "RSC" / "React Server Components", "LLM" / "Large Language Models" — these score high on substring containment and are almost always safe to merge.
Different versions are not duplicates. "GPT-3" and "GPT-4" are related but distinct. "fine-tuning" and "fine-tuning-llms" may be distinct (technique vs. specific application).
Run
cross-linker
after dedup. The redirect stubs leave the graph in a slightly inconsistent state. Cross-linker will tighten it up.

**始终先执行审计。**即使在自动合并模式下，也会显示审计报告。信任结果前先阅读报告。
**最后处理
```
needs-review
```
项。**这些是棘手的情况——不要将它们与明显的合并项批量处理。
缩写是最常见的情况。“GPT”/“GPT-4”/“GPT4”、“RSC”/“React Server Components”、“LLM”/“Large Language Models”——这些在子字符串包含方面得分很高，几乎总是可以安全合并。
不同版本不是重复项。“GPT-3”和“GPT-4”相关但不同。“fine-tuning”和“fine-tuning-llms”可能不同（通用技术 vs 特定应用）。
**去重后运行
```
cross-linker
```
。**重定向存根会使图谱处于略微不一致的状态。cross-linker将修复此问题。