data-ingest

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Ingest — Universal Text Source Handler

数据导入——通用文本源处理器

You are ingesting arbitrary text data into an Obsidian wiki. The source could be anything — conversation exports, log files, transcripts, data dumps. Your job is to figure out the format, extract knowledge, and distill it into wiki pages.

你正将任意文本数据导入Obsidian知识库。数据源可以是任何类型——会话导出文件、日志文件、转录稿、数据转储。你的任务是识别格式、提取知识，并将其提炼为知识库页面。

Before You Start

开始前准备

Read
```
.env
```
to get
```
OBSIDIAN_VAULT_PATH
```
Read
```
.manifest.json
```
at the vault root — check if this source has been ingested before
Read
```
index.md
```
at the vault root to know what already exists

If the source path is already in

.manifest.json

and the file hasn't been modified since

ingested_at

, tell the user it's already been ingested. Ask if they want to re-ingest anyway.

读取
```
.env
```
文件获取
```
OBSIDIAN_VAULT_PATH
```
读取库根目录下的
```
.manifest.json
```
——检查该数据源是否已经被导入过
读取库根目录下的
```
index.md
```
了解现有内容

如果数据源路径已经存在于

.manifest.json

中，且文件自

ingested_at

记录的时间后没有被修改，告知用户该内容已被导入，询问他们是否仍要重新导入。

Step 1: Identify the Source Format

步骤1：识别数据源格式

Read the file(s) the user points you at. Common formats you'll encounter:

Format	How to identify	How to read
JSON / JSONL	`.json` / `.jsonl` extension, starts with `{` or `[`	Parse with Read tool, look for message/content fields
Markdown	`.md` extension	Read directly
Plain text	`.txt` extension or no extension	Read directly
CSV / TSV	`.csv` / `.tsv` , comma or tab separated	Parse rows, identify columns
HTML	`.html` , starts with `<`	Extract text content, ignore markup
Chat export	Varies — look for turn-taking patterns (user/assistant, human/ai, timestamps)	Extract the dialogue turns
Images	`.png` / `.jpg` / `.jpeg` / `.webp` / `.gif`	Requires a vision-capable model. Use the Read tool — it renders images into your context. Screenshots, whiteboards, diagrams all qualify. Models without vision support should skip and report which files were skipped.

读取用户指定的文件。你会遇到的常见格式如下：

格式	识别方式	读取方式
JSON / JSONL	扩展名为 `.json` / `.jsonl` ，以 `{` 或 `[` 开头	使用Read工具解析，查找message/content字段
Markdown	扩展名为 `.md`	直接读取
纯文本	扩展名为 `.txt` 或无扩展名	直接读取
CSV / TSV	扩展名为 `.csv` / `.tsv` ，用逗号或制表符分隔	解析行，识别列
HTML	扩展名为 `.html` ，以 `<` 开头	提取文本内容，忽略标记
聊天导出文件	不固定——查找轮次对话模式（用户/助手、人类/AI、时间戳）	提取对话轮次
图片	扩展名为 `.png` / `.jpg` / `.jpeg` / `.webp` / `.gif`	需要支持视觉能力的模型。使用Read工具——它会将图片渲染到你的上下文中。截图、白板内容、图表都适用。不支持视觉能力的模型应跳过这类文件，并报告跳过了哪些文件。

Common Chat Export Formats

常见聊天导出格式

ChatGPT export (

conversations.json

json

[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]

Slack export (directory of JSON files per channel):

json

[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]

Generic chat log (timestamped text):

[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here

Don't try to handle every format upfront — read the actual data, figure out the structure, and adapt.

ChatGPT导出文件（

conversations.json

）:

json

[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]

Slack导出文件（每个频道对应一个JSON文件的目录）:

json

[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]

通用聊天日志（带时间戳的文本）:

[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here

无需提前尝试处理所有格式——读取实际数据，理清结构，然后自适应处理即可。

Images and visual sources

图片和视觉数据源

When the user dumps a folder of screenshots, whiteboard photos, or diagram exports, treat each image as a source:

Use the Read tool on the image path — it will render the image into context.
Transcribe any visible text verbatim (this is the only extracted content from an image).
Describe structure: for diagrams, list nodes/edges; for screenshots, name the app and what's on screen.
Extract the concepts the image conveys — what's it about? Most of this is
```
^[inferred]
```
.
Flag anything you can't read, can't identify, or are guessing at with
```
^[ambiguous]
```
.

Image-derived pages will skew heavily inferred — that's expected and the provenance markers will reflect it. Set

source_type: "image"

in the manifest entry. Skip files with EXIF-only changes (re-saved with no visual diff) — compare via the standard delta logic.

For folders of mixed images (e.g. a screenshot timeline of a debugging session), cluster by visible topic rather than per-file. Twenty screenshots of the same UI bug should produce one wiki page, not twenty.

当用户提供包含截图、白板照片或图表导出文件的文件夹时，将每张图片视为一个数据源：

对图片路径使用Read工具——它会将图片渲染到上下文中。
逐字转录所有可见文本（这是从图片中提取的唯一内容）。
描述结构：对于图表，列出节点/边；对于截图，标注应用名称和屏幕显示内容。
提取图片传达的概念——它的主题是什么？大部分内容会标记为
```
^[inferred]
```
。
对所有你无法读取、无法识别或猜测的内容用
```
^[ambiguous]
```
标记。

从图片生成的页面会包含大量推断内容——这是正常现象，来源标记会体现这一点。在manifest条目中设置

source_type: "image"

。跳过仅修改了EXIF信息的文件（重新保存但无视觉差异）——通过标准差异逻辑进行比对。

对于包含混合图片的文件夹（例如调试过程的截图时间线），按可见主题聚类，而不是按文件拆分。20张同个UI bug的截图应该生成1个知识库页面，而不是20个。

Step 2: Extract Knowledge

步骤2：提取知识

Regardless of format, extract the same things:

Topics discussed — what subjects come up?
Decisions made — what was concluded or decided?
Facts learned — what concrete information is stated?
Procedures described — how-to knowledge, workflows, steps
Entities mentioned — people, tools, projects, organizations
Connections — how do topics relate to each other and to existing wiki content?

无论格式是什么，都提取以下内容：

讨论的主题——涉及哪些话题？
做出的决策——得出了什么结论或做出了什么决定？
获知的事实——陈述了哪些具体信息？
描述的流程——实操知识、工作流、步骤
提及的实体——人物、工具、项目、组织
关联关系——各主题之间以及与现有知识库内容有什么关联？

For conversation data specifically:

针对会话数据的特殊要求：

Focus on the substance, not the dialogue. A 50-message debugging session might yield one skills page about the fix. A long brainstorming chat might yield three concept pages.

Skip:

Greetings, pleasantries, meta-conversation ("can you help me with...")
Repetitive back-and-forth that doesn't add new information
Raw code dumps (unless they illustrate a reusable pattern)

关注实质内容，而不是对话本身。一段包含50条消息的调试会话可能只会生成1个关于修复方案的skill页面。一段漫长的头脑风暴聊天可能会生成3个概念页面。

跳过：

问候、寒暄、元对话（"can you help me with..."）
没有新增信息的重复来回对话
原始代码转储（除非它们展示了可复用的模式）

Step 3: Cluster and Deduplicate

步骤3：聚类和去重

Before creating pages:

Group extracted knowledge by topic (not by source file or conversation)
Check existing wiki pages — does this knowledge belong on an existing page?
Merge overlapping information from multiple sources
Note contradictions between sources

创建页面之前：

按主题对提取的知识分组（不要按源文件或对话分组）
检查现有知识库页面——这些知识是否应该归入现有页面？
合并多个来源的重叠信息
标注不同来源之间的矛盾内容

Step 4: Distill into Wiki Pages

步骤4：提炼为知识库页面

Follow the

wiki-ingest

skill's process for creating/updating pages:

Use correct category directories (
```
concepts/
```
,
```
entities/
```
,
```
skills/
```
, etc.)
Add YAML frontmatter with title, category, tags, sources
Use
```
[[wikilinks]]
```
to connect to existing pages
Attribute claims to their source
Write a
summary:
frontmatter field on every new page (1–2 sentences, ≤200 characters) answering "what is this page about?" — this is what downstream skills read to avoid opening the page body.
Apply provenance markers per the convention in
```
llm-wiki
```
. Conversation, log, and chat data tend to be high-inference — you're often reading between the turns to extract a coherent claim. Be liberal with
```
^[inferred]
```
for synthesized patterns and with
```
^[ambiguous]
```
when speakers contradict each other or you're unsure who's right. Write a
```
provenance:
```
frontmatter block on each new/updated page.

遵循

wiki-ingest

skill的流程创建/更新页面：

使用正确的分类目录（
```
concepts/
```
、
```
entities/
```
、
```
skills/
```
等）
添加包含标题、分类、标签、来源的YAML frontmatter
使用
```
[[wikilinks]]
```
关联现有页面
为声明标注来源
在每个新页面的frontmatter中添加
summary:
字段（1–2句话，≤200字符）回答“这个页面是关于什么的？”——下游skill会读取这个字段，避免打开页面正文。
按照
```
llm-wiki
```
的规范应用来源标记。会话、日志和聊天数据通常属于高推断内容——你经常需要在对话轮次之间挖掘信息来提取连贯的声明。对于合成的模式可以大量使用
```
^[inferred]
```
标记，当发言者互相矛盾或你不确定谁是正确的时使用
```
^[ambiguous]
```
标记。在每个新建/更新页面的frontmatter中添加
```
provenance:
```
块。

Step 5: Update Manifest and Special Files

步骤5：更新Manifest和特殊文件

.manifest.json
— Add an entry for each source file processed:

json

{
  "ingested_at": "TIMESTAMP",
  "size_bytes": FILE_SIZE,
  "modified_at": FILE_MTIME,
  "source_type": "data",  // or "image" for png/jpg/webp/gif sources
  "project": "project-name-or-null",
  "pages_created": ["list/of/pages.md"],
  "pages_updated": ["list/of/pages.md"]
}

index.md
and log.md
:

- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y

.manifest.json
——为每个处理过的源文件添加一条记录：

json

{
  "ingested_at": "TIMESTAMP",
  "size_bytes": FILE_SIZE,
  "modified_at": FILE_MTIME,
  "source_type": "data",  // or "image" for png/jpg/webp/gif sources
  "project": "project-name-or-null",
  "pages_created": ["list/of/pages.md"],
  "pages_updated": ["list/of/pages.md"]
}

index.md
和
log.md
:

- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y

Tips

提示

When in doubt about format, just read it. The Read tool will show you what you're dealing with.
Large files: Read in chunks using offset/limit. Don't try to load a 10MB JSON in one go.
Multiple files: Process them in order, building up wiki pages incrementally.
Binary files: Skip them, except images — those are first-class sources via the Read tool's vision support.
Encoding issues: If you see garbled text, mention it to the user and move on.

如果对格式有疑问，直接读取即可。 Read工具会告诉你你正在处理的内容是什么。
大文件： 使用offset/limit分块读取。不要尝试一次性加载10MB的JSON文件。
多文件： 按顺序处理，逐步构建知识库页面。
二进制文件： 跳过，图片除外——通过Read工具的视觉支持，图片是一等数据源。
编码问题： 如果你看到乱码文本，告知用户后继续处理其他内容。