wiki-dedup
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWiki Dedup — Identity Resolution and Page-Level Deduplication
Wiki 去重 — 标识解析与页面级重复数据删除
You are finding and merging wiki pages that cover the same concept under different names. This is a write-heavy, potentially destructive skill — page merges cannot be automatically undone. Work carefully and confirm before acting in merge mode.
Follow the Retrieval Primitives table in . The candidate-detection pass uses only frontmatter and titles (cheap). Only open full page bodies for confirmed candidate pairs.
llm-wiki/SKILL.md你需要查找并合并以不同名称涵盖同一概念的wiki页面。这是一项写入密集型、具有潜在破坏性的技能——页面合并无法自动撤销。操作时需谨慎,在合并模式下执行操作前务必确认。
**遵循中的检索原语表。**候选检测阶段仅使用前置元数据和标题(成本较低)。仅为已确认的候选对打开完整页面内容。
llm-wiki/SKILL.mdBefore You Start
开始之前
- Resolve config — follow the Config Resolution Protocol in (walk up CWD for
llm-wiki/SKILL.md→.env→ prompt setup). This gives~/.obsidian-wiki/configandOBSIDIAN_VAULT_PATH.OBSIDIAN_LINK_FORMAT - Read to get the full page inventory with one-line descriptions and tags.
index.md - Read briefly — if a dedup run just happened, note what was already merged.
log.md
- 解析配置 — 遵循中的配置解析协议(从当前工作目录向上查找
llm-wiki/SKILL.md→.env→ 提示设置)。这将获取~/.obsidian-wiki/config和OBSIDIAN_VAULT_PATH。OBSIDIAN_LINK_FORMAT - 读取以获取包含单行描述和标签的完整页面清单。
index.md - 简要阅读— 如果刚刚执行过去重操作,记录已合并的内容。
log.md
Modes
模式
| Mode | Flag | Behavior |
|---|---|---|
| Audit | (default) | Report candidates only — no writes |
| Merge | | Show each confirmed pair, ask for confirmation before merging |
| Auto-merge | | Merge all high-confidence pairs ( |
If the user doesn't specify, run in Audit mode and present findings before asking whether to proceed.
| 模式 | 标识 | 行为 |
|---|---|---|
| 审计 | (默认) | 仅报告候选对 — 不执行写入操作 |
| 合并 | | 显示每个已确认的候选对,合并前请求确认 |
| 自动合并 | | 以非交互方式合并所有高置信度候选对( |
如果用户未指定模式,则以审计模式运行,先呈现结果再询问是否继续。
Step 1: Build the Page Registry
步骤1:构建页面注册表
Glob all files in the vault (excluding , , , , , , , and any file that contains in its frontmatter — those are already merged redirect stubs).
.md_archives/_raw/.obsidian/index.mdlog.mdhot.md_insights.mdredirects_to:For each remaining page, extract from frontmatter:
- — relative path from vault root, without
node_id.md - — frontmatter
titlefieldtitle - — frontmatter
aliaseslist (may be absent)aliases - — frontmatter
tagslisttags - — directory prefix
category
Build a lookup table: .
node_id → {title, aliases, tags, category, summary}遍历vault中的所有文件(排除、、、、、、,以及前置元数据中包含的文件——这些是已合并的重定向存根)。
.md_archives/_raw/.obsidian/index.mdlog.mdhot.md_insights.mdredirects_to:对于每个剩余页面,从前置元数据中提取:
- — 相对于vault根目录的路径,不含
node_id后缀.md - — 前置元数据中的
title字段title - — 前置元数据中的
aliases列表(可能不存在)aliases - — 前置元数据中的
tags列表tags - — 目录前缀
category
构建查找表:。
node_id → {title, aliases, tags, category, summary}Step 2: Detect Candidate Pairs
步骤2:检测候选对
For every pair of pages in the registry, compute a similarity score using these signals:
对于注册表中的每一对页面,使用以下信号计算相似度得分:
2a. Title similarity signals
2a. 标题相似度信号
| Signal | How to assess | Max contribution |
|---|---|---|
| Token overlap | Jaccard similarity of lowercased title word-tokens (split on spaces, hyphens, underscores, punctuation) | 0.65 |
| Edit distance | Normalized edit distance on lowercased titles: | 0.40 |
| Substring containment | One title is a substring of the other (e.g. "RSC" ⊂ "React Server Components") | 0.50 |
| Alias cross-match | Page A's title appears in page B's | 0.65 |
Composite title score = .
min(max(token_overlap, edit_distance, substring), 0.65) + alias_cross_bonusYou don't need exact arithmetic — make a confident judgement about degree of similarity.
Title extraction note: Some pages use YAML block scalars ( or ). When the value is , , , or , the actual title is on the next indented line — read it from there. Never compare the literal string as a title.
title: >-title: |title:>->||->-| 信号 | 评估方式 | 最大贡献值 |
|---|---|---|
| Token重叠 | 小写标题词Token的Jaccard相似度(按空格、连字符、下划线、标点符号拆分) | 0.65 |
| 编辑距离 | 小写标题的标准化编辑距离: | 0.40 |
| 子字符串包含 | 一个标题是另一个标题的子字符串(例如“RSC” ⊂ “React Server Components”) | 0.50 |
| 别名交叉匹配 | 页面A的标题出现在页面B的 | 0.65 |
复合标题得分 = 。
min(max(token_overlap, edit_distance, substring), 0.65) + alias_cross_bonus无需精确计算——只需对相似程度做出可靠判断即可。
**标题提取注意事项:**部分页面使用YAML块标量(或)。当的值为、、或时,实际标题在下一行缩进位置——从此处读取。切勿将字面字符串作为标题进行比较。
title: >-title: |title:>->||->-2b. Semantic signals (cheap pass)
2b. 语义信号(快速检测)
| Signal | Points |
|---|---|
Same | +0.10 |
| Tag overlap ≥ 3 shared tags | +0.15 |
| Tag overlap ≥ 2 shared tags | +0.05 |
| Same first tag (dominant tag) | +0.05 |
| 信号 | 加分 |
|---|---|
同一 | +0.10 |
| 标签重叠≥3个共享标签 | +0.15 |
| 标签重叠≥2个共享标签 | +0.05 |
| 第一个标签相同(主导标签) | +0.05 |
2c. Threshold
2c. 阈值
Flag pairs with composite score ≥ 0.75 as candidates. Pairs scoring 0.90+ are high-confidence.
Score ranges → confidence labels:
| Score | Label |
|---|---|
| ≥ 0.90 | HIGH — almost certainly the same concept |
| 0.75–0.89 | MEDIUM — likely the same, verify |
| 0.60–0.74 | LOW — possible abbreviation or specialisation; skip unless user asks |
Only carry HIGH and MEDIUM candidates into Step 3.
将复合得分≥0.75的对标记为候选对。得分0.90+的对为高置信度。
得分范围→置信度标签:
| 得分 | 标签 |
|---|---|
| ≥ 0.90 | HIGH — 几乎可以确定是同一概念 |
| 0.75–0.89 | MEDIUM — 可能是同一概念,需验证 |
| 0.60–0.74 | LOW — 可能是缩写或特化内容;除非用户要求,否则跳过 |
仅将HIGH和MEDIUM级别的候选对带入步骤3。
2d. Quick exit rule
2d. 快速退出规则
If the vault has fewer than 10 pages, skip the pair loop and report "vault too small to have meaningful duplicates". If the vault has more than 500 pages, process candidates in batches of 50 pairs — pause and report progress between batches.
如果vault中的页面少于10个,跳过配对循环并报告“vault过小,不存在有意义的重复项”。如果vault中的页面超过500个,以50对为一批处理候选对——每批之间暂停并报告进度。
Step 3: Semantic Verdict
步骤3:语义判定
For each candidate pair (sorted by score descending):
- Read both pages in full (full page read — justified because candidate pool is small).
- Ask: are these pages covering the same concept, or are they distinct?
Assign one of three verdicts:
| Verdict | Meaning |
|---|---|
| Same concept — different name, abbreviation, alias, or accidental duplicate. Safe to merge. |
| Related but distinct — e.g. "Server Actions" vs "Server Components" are related React features, not duplicates. |
| Ambiguous — substantial overlap but also meaningful differences. Flag for the user to decide. |
Attach a short reason to each verdict (one sentence). This appears in the report and the log.
对于每个候选对(按得分降序排序):
- 完整读取两个页面(读取完整页面是合理的,因为候选池规模较小)。
- 提问:这些页面是否涵盖同一概念,还是彼此不同?
分配以下三种判定之一:
| 判定 | 含义 |
|---|---|
| 同一概念——不同名称、缩写、别名或意外重复。可以安全合并。 |
| 相关但不同——例如“Server Actions”与“Server Components”是相关的React特性,并非重复项。 |
| 模糊不清——存在大量重叠但也有显著差异。标记为需用户决策。 |
为每个判定附加简短理由(一句话)。该理由将出现在报告和日志中。
Step 4: Audit Report
步骤4:审计报告
Always produce this report, even in merge/auto-merge mode (so the user sees what will happen):
markdown
undefined无论处于合并/自动合并模式,始终生成此报告(以便用户了解将要执行的操作):
markdown
undefinedWiki Dedup Report
Wiki 去重报告
High-Confidence Candidates (score ≥ 0.90): N pairs
高置信度候选对(得分≥0.90):N对
| Score | Page A | Page B | Verdict | Reason |
|---|---|---|---|---|
| 0.95 | | | merge | "RSC" is the abbreviation; both pages cover identical material |
| 0.91 | | | keep-separate | One is a person stub, one is a paper reference |
| 得分 | 页面A | 页面B | 判定 | 理由 |
|---|---|---|---|---|
| 0.95 | | | merge | "RSC"是缩写;两个页面内容完全一致 |
| 0.91 | | | keep-separate | 一个是人物存根,一个是论文参考 |
Medium-Confidence Candidates (score 0.75–0.89): N pairs
中等置信度候选对(得分0.75–0.89):N对
| Score | Page A | Page B | Verdict | Reason |
|---|---|---|---|---|
| 0.82 | | | merge | Same concept, hyphenation variant |
| 得分 | 页面A | 页面B | 判定 | 理由 |
|---|---|---|---|---|
| 0.82 | | | merge | 同一概念,连字符变体 |
Needs Human Review: N pairs
需人工审核:N对
| Score | Page A | Page B | Reason |
|---|---|---|---|
| 0.78 | | | Substantial overlap but "agents" may intentionally be broader |
| 得分 | 页面A | 页面B | 理由 |
|---|---|---|---|
| 0.78 | | | 存在大量重叠,但“agents”可能有意涵盖更广泛内容 |
Summary
摘要
- Pages scanned: N
- Candidate pairs found: M
- Recommended merges: X
- Keep separate: Y
- Needs review: Z
In **Audit mode**, stop here and ask: "Run `--merge` to interactively merge the recommended pairs, or `--auto` to merge all high-confidence ones automatically?"- 扫描页面数:N
- 发现候选对:M
- 建议合并数:X
- 建议保留独立数:Y
- 需审核数:Z
在**审计模式**下,到此为止并询问:“是否运行`--merge`以交互式合并建议的候选对,或运行`--auto`自动合并所有高置信度候选对?”Step 5: Merge
步骤5:合并
For each verdict pair (in merge or auto-merge mode):
mergeIn merge mode: show the pair and verdict, then ask: "Merge into ? (yes/skip/review)". Skip on anything other than yes.
[Page A][Page B]In auto-merge mode: only process HIGH-confidence () merges without prompting.
score ≥ 0.90对于每个判定为的候选对(合并或自动合并模式下):
merge在合并模式下:显示候选对和判定,然后询问:“是否将合并到中?(是/跳过/审核)”。非“是”则跳过。
[页面A][页面B]在自动合并模式下:仅处理高置信度()的合并,无需提示。
score ≥ 0.905a: Pick the canonical page
5a:选择标准页面
Apply these tiebreakers in order until one wins:
- More incoming wikilinks — grep the vault for references; higher count wins
[[node_id]] - Richer content — longer page body (more lines) wins
- More sources — larger list wins
sources: - Title length — longer, more descriptive title wins (e.g. "React Server Components" beats "RSC")
- Alphabetical — earlier title wins
The canonical page is the survivor. The other page becomes the secondary (to be merged in, then replaced with a redirect stub).
按以下顺序应用平局规则,直到选出获胜者:
- 传入wikilinks更多 — 在vault中搜索引用;计数更高者获胜
[[node_id]] - 内容更丰富 — 页面正文更长(行数更多)者获胜
- 来源更多 — 列表更长者获胜
sources: - 标题长度 — 更长、描述性更强的标题获胜(例如“React Server Components”优于“RSC”)
- 字母顺序 — 标题字母顺序更靠前的获胜
标准页面是保留页。另一个页面成为次要页(将被合并,然后替换为重定向存根)。
5b: Merge content into the canonical page
5b:将内容合并到标准页面
Read both pages. Update the canonical page:
- — add secondary page's title and all its aliases (no duplicates)
aliases: - — merge both tag lists (deduplicate, cap at 5 domain tags + system tags)
tags: - — merge both source lists (deduplicate)
sources: - — merge both relationship lists (deduplicate by target, prefer typed entries over untyped)
relationships: - — recompute using the union of sources and the formula from
base_confidencellm-wiki/SKILL.md - — set to now
updated - — rewrite to cover the merged scope if the secondary page added new ground
summary: - Body content — merge unique sections and bullets from the secondary page. Do not blindly append — integrate the content. Avoid duplicating claims already present in the canonical page. Use markers where synthesis is needed.
^[inferred] - — recompute after merging
provenance:
读取两个页面。更新标准页面:
- — 添加次要页的标题及其所有别名(无重复)
aliases: - — 合并两个标签列表(去重,限制为5个领域标签+系统标签)
tags: - — 合并两个来源列表(去重)
sources: - — 合并两个关系列表(按目标去重,优先选择带类型的条目而非无类型条目)
relationships: - — 使用来源的并集和
base_confidence中的公式重新计算llm-wiki/SKILL.md - — 设置为当前时间
updated - — 如果次要页添加了新内容,重写摘要以涵盖合并后的范围
summary: - 正文内容 — 合并次要页中独有的章节和项目符号。不要盲目追加——整合内容。避免重复标准页面中已有的内容。需要合成时使用标记。
^[inferred] - — 合并后重新计算
provenance:
5c: Write a redirect stub at the secondary page path
5c:在次要页路径写入重定向存根
markdown
---
title: <secondary page title>
redirects_to: "[[<canonical node_id>]]"
aliases: [<secondary aliases>]
category: <secondary category>
tags: []
created: <secondary original created>
updated: <ISO timestamp now>
---
This page has been merged into [[<canonical page title>]].The field tells any skill reading this page to follow the redirect rather than treat it as content.
redirects_to:markdown
---
title: <次要页标题>
redirects_to: "[[<标准node_id>]]"
aliases: [<次要页别名>]
category: <次要页分类>
tags: []
created: <次要页原始创建时间>
updated: <ISO当前时间戳>
---
此页面已合并到[[<标准页标题>]]。redirects_to:5d: Rewrite wikilinks vault-wide
5d:全局重写wikilinks
Grep the entire vault for any link pointing at the secondary slug:
- →
[[secondary-slug]][[canonical-slug]] - →
[[secondary-slug|display text]][[canonical-slug|display text]] - If :
OBSIDIAN_LINK_FORMAT=markdown→[text](../path/to/secondary.md)[text](../path/to/canonical.md)
Safety rules:
- Never rewrite inside code blocks (``` fences or )
inline code - Never rewrite inside the redirect stub itself (that's the one place the old slug should remain legible)
- Never use or destructive shell ops — only Edit/Write tools
rm - Rewrite one file at a time, verifying each before moving on
- If a file has zero occurrences, skip it
在整个vault中搜索指向次要页slug的链接:
- →
[[secondary-slug]][[canonical-slug]] - →
[[secondary-slug|显示文本]][[canonical-slug|显示文本]] - 如果:
OBSIDIAN_LINK_FORMAT=markdown→[text](../path/to/secondary.md)[text](../path/to/canonical.md)
安全规则:
- 切勿重写代码块内的内容(```围栏或)
行内代码 - 切勿重写重定向存根本身(这是唯一应保留旧slug可读性的地方)
- 切勿使用或破坏性shell操作——仅使用编辑/写入工具
rm - 一次重写一个文件,每次重写后验证再继续
- 如果文件中无匹配项,跳过该文件
5e: Update tracking files
5e:更新跟踪文件
index.md.manifest.json"merged_into": "<canonical node_id>"pages_createdpages_updatedhot.mdindex.md.manifest.json"merged_into": "<标准node_id>"pages_createdpages_updatedhot.md5f: Final check
5f:最终检查
After all merges, grep the vault for any remaining references (in non-stub files). If any survive, report them — the rewrite step may have missed a non-standard link format.
[[secondary-slug]]所有合并完成后,在vault中搜索是否还有剩余的引用(非存根文件中)。如果有任何残留,报告这些引用——重写步骤可能遗漏了非标准链接格式。
[[secondary-slug]]Step 6: Log
步骤6:日志
Append to :
log.md- [TIMESTAMP] DEDUP mode=audit|merge|auto-merge pages_scanned=N pairs_found=M merged=X kept_separate=Y needs_review=Z wikilinks_rewritten=W向追加内容:
log.md- [时间戳] DEDUP mode=audit|merge|auto-merge pages_scanned=N pairs_found=M merged=X kept_separate=Y needs_review=Z wikilinks_rewritten=WRedirect Stub Handling
重定向存根处理
Other skills should handle redirect stubs as follows:
- — skip pages with
wiki-exportin frontmatter; they are not content nodesredirects_to: - — if a search hits a redirect stub, follow
wiki-queryand read the canonical page insteadredirects_to: - — validate that every
wiki-lintwikilink resolves to an existing, non-stub page (a redirect chain — stub pointing to stub — is an error)redirects_to: - — treat redirect stubs as non-targets; never add a new
cross-linkerpointing at a stub page[[wikilink]]
其他技能应按以下方式处理重定向存根:
- — 跳过前置元数据中包含
wiki-export的页面;它们不是内容节点redirects_to: - — 如果搜索命中重定向存根,跟随
wiki-query并读取标准页面redirects_to: - — 验证每个
wiki-lintwikilink是否指向存在的非存根页面(重定向链——存根指向存根——是错误)redirects_to: - — 将重定向存根视为非目标;切勿添加指向存根页面的新
cross-linker[[wikilink]]
Tips
提示
- Audit first, always. Even in auto-merge mode, the audit report is shown. Read it before trusting the results.
- Check last. These are the hard cases — don't batch them with obvious merges.
needs-review - Abbreviations are the most common case. "GPT" / "GPT-4" / "GPT4", "RSC" / "React Server Components", "LLM" / "Large Language Models" — these score high on substring containment and are almost always safe to merge.
- Different versions are not duplicates. "GPT-3" and "GPT-4" are related but distinct. "fine-tuning" and "fine-tuning-llms" may be distinct (technique vs. specific application).
- Run after dedup. The redirect stubs leave the graph in a slightly inconsistent state. Cross-linker will tighten it up.
cross-linker
- **始终先执行审计。**即使在自动合并模式下,也会显示审计报告。信任结果前先阅读报告。
- **最后处理项。**这些是棘手的情况——不要将它们与明显的合并项批量处理。
needs-review - 缩写是最常见的情况。“GPT”/“GPT-4”/“GPT4”、“RSC”/“React Server Components”、“LLM”/“Large Language Models”——这些在子字符串包含方面得分很高,几乎总是可以安全合并。
- 不同版本不是重复项。“GPT-3”和“GPT-4”相关但不同。“fine-tuning”和“fine-tuning-llms”可能不同(通用技术 vs 特定应用)。
- **去重后运行。**重定向存根会使图谱处于略微不一致的状态。cross-linker将修复此问题。
cross-linker