data-ingest
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Ingest — Universal Text Source Handler
数据导入——通用文本源处理器
You are ingesting arbitrary text data into an Obsidian wiki. The source could be anything — conversation exports, log files, transcripts, data dumps. Your job is to figure out the format, extract knowledge, and distill it into wiki pages.
你正将任意文本数据导入Obsidian知识库。数据源可以是任何类型——会话导出文件、日志文件、转录稿、数据转储。你的任务是识别格式、提取知识,并将其提炼为知识库页面。
Before You Start
开始前准备
- Read to get
.envOBSIDIAN_VAULT_PATH - Read at the vault root — check if this source has been ingested before
.manifest.json - Read at the vault root to know what already exists
index.md
If the source path is already in and the file hasn't been modified since , tell the user it's already been ingested. Ask if they want to re-ingest anyway.
.manifest.jsoningested_at- 读取文件获取
.envOBSIDIAN_VAULT_PATH - 读取库根目录下的——检查该数据源是否已经被导入过
.manifest.json - 读取库根目录下的了解现有内容
index.md
如果数据源路径已经存在于中,且文件自记录的时间后没有被修改,告知用户该内容已被导入,询问他们是否仍要重新导入。
.manifest.jsoningested_atStep 1: Identify the Source Format
步骤1:识别数据源格式
Read the file(s) the user points you at. Common formats you'll encounter:
| Format | How to identify | How to read |
|---|---|---|
| JSON / JSONL | | Parse with Read tool, look for message/content fields |
| Markdown | | Read directly |
| Plain text | | Read directly |
| CSV / TSV | | Parse rows, identify columns |
| HTML | | Extract text content, ignore markup |
| Chat export | Varies — look for turn-taking patterns (user/assistant, human/ai, timestamps) | Extract the dialogue turns |
| Images | | Requires a vision-capable model. Use the Read tool — it renders images into your context. Screenshots, whiteboards, diagrams all qualify. Models without vision support should skip and report which files were skipped. |
读取用户指定的文件。你会遇到的常见格式如下:
| 格式 | 识别方式 | 读取方式 |
|---|---|---|
| JSON / JSONL | 扩展名为 | 使用Read工具解析,查找message/content字段 |
| Markdown | 扩展名为 | 直接读取 |
| 纯文本 | 扩展名为 | 直接读取 |
| CSV / TSV | 扩展名为 | 解析行,识别列 |
| HTML | 扩展名为 | 提取文本内容,忽略标记 |
| 聊天导出文件 | 不固定——查找轮次对话模式(用户/助手、人类/AI、时间戳) | 提取对话轮次 |
| 图片 | 扩展名为 | 需要支持视觉能力的模型。 使用Read工具——它会将图片渲染到你的上下文中。截图、白板内容、图表都适用。不支持视觉能力的模型应跳过这类文件,并报告跳过了哪些文件。 |
Common Chat Export Formats
常见聊天导出格式
ChatGPT export ():
conversations.jsonjson
[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]Slack export (directory of JSON files per channel):
json
[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]Generic chat log (timestamped text):
[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response hereDon't try to handle every format upfront — read the actual data, figure out the structure, and adapt.
ChatGPT导出文件():
conversations.jsonjson
[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]Slack导出文件(每个频道对应一个JSON文件的目录):
json
[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]通用聊天日志(带时间戳的文本):
[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here无需提前尝试处理所有格式——读取实际数据,理清结构,然后自适应处理即可。
Images and visual sources
图片和视觉数据源
When the user dumps a folder of screenshots, whiteboard photos, or diagram exports, treat each image as a source:
- Use the Read tool on the image path — it will render the image into context.
- Transcribe any visible text verbatim (this is the only extracted content from an image).
- Describe structure: for diagrams, list nodes/edges; for screenshots, name the app and what's on screen.
- Extract the concepts the image conveys — what's it about? Most of this is .
^[inferred] - Flag anything you can't read, can't identify, or are guessing at with .
^[ambiguous]
Image-derived pages will skew heavily inferred — that's expected and the provenance markers will reflect it. Set in the manifest entry. Skip files with EXIF-only changes (re-saved with no visual diff) — compare via the standard delta logic.
source_type: "image"For folders of mixed images (e.g. a screenshot timeline of a debugging session), cluster by visible topic rather than per-file. Twenty screenshots of the same UI bug should produce one wiki page, not twenty.
当用户提供包含截图、白板照片或图表导出文件的文件夹时,将每张图片视为一个数据源:
- 对图片路径使用Read工具——它会将图片渲染到上下文中。
- 逐字转录所有可见文本(这是从图片中提取的唯一内容)。
- 描述结构:对于图表,列出节点/边;对于截图,标注应用名称和屏幕显示内容。
- 提取图片传达的概念——它的主题是什么?大部分内容会标记为。
^[inferred] - 对所有你无法读取、无法识别或猜测的内容用标记。
^[ambiguous]
从图片生成的页面会包含大量推断内容——这是正常现象,来源标记会体现这一点。在manifest条目中设置。跳过仅修改了EXIF信息的文件(重新保存但无视觉差异)——通过标准差异逻辑进行比对。
source_type: "image"对于包含混合图片的文件夹(例如调试过程的截图时间线),按可见主题聚类,而不是按文件拆分。20张同个UI bug的截图应该生成1个知识库页面,而不是20个。
Step 2: Extract Knowledge
步骤2:提取知识
Regardless of format, extract the same things:
- Topics discussed — what subjects come up?
- Decisions made — what was concluded or decided?
- Facts learned — what concrete information is stated?
- Procedures described — how-to knowledge, workflows, steps
- Entities mentioned — people, tools, projects, organizations
- Connections — how do topics relate to each other and to existing wiki content?
无论格式是什么,都提取以下内容:
- 讨论的主题——涉及哪些话题?
- 做出的决策——得出了什么结论或做出了什么决定?
- 获知的事实——陈述了哪些具体信息?
- 描述的流程——实操知识、工作流、步骤
- 提及的实体——人物、工具、项目、组织
- 关联关系——各主题之间以及与现有知识库内容有什么关联?
For conversation data specifically:
针对会话数据的特殊要求:
Focus on the substance, not the dialogue. A 50-message debugging session might yield one skills page about the fix. A long brainstorming chat might yield three concept pages.
Skip:
- Greetings, pleasantries, meta-conversation ("can you help me with...")
- Repetitive back-and-forth that doesn't add new information
- Raw code dumps (unless they illustrate a reusable pattern)
关注实质内容,而不是对话本身。一段包含50条消息的调试会话可能只会生成1个关于修复方案的skill页面。一段漫长的头脑风暴聊天可能会生成3个概念页面。
跳过:
- 问候、寒暄、元对话("can you help me with...")
- 没有新增信息的重复来回对话
- 原始代码转储(除非它们展示了可复用的模式)
Step 3: Cluster and Deduplicate
步骤3:聚类和去重
Before creating pages:
- Group extracted knowledge by topic (not by source file or conversation)
- Check existing wiki pages — does this knowledge belong on an existing page?
- Merge overlapping information from multiple sources
- Note contradictions between sources
创建页面之前:
- 按主题对提取的知识分组(不要按源文件或对话分组)
- 检查现有知识库页面——这些知识是否应该归入现有页面?
- 合并多个来源的重叠信息
- 标注不同来源之间的矛盾内容
Step 4: Distill into Wiki Pages
步骤4:提炼为知识库页面
Follow the skill's process for creating/updating pages:
wiki-ingest- Use correct category directories (,
concepts/,entities/, etc.)skills/ - Add YAML frontmatter with title, category, tags, sources
- Use to connect to existing pages
[[wikilinks]] - Attribute claims to their source
- Write a frontmatter field on every new page (1–2 sentences, ≤200 characters) answering "what is this page about?" — this is what downstream skills read to avoid opening the page body.
summary: - Apply provenance markers per the convention in . Conversation, log, and chat data tend to be high-inference — you're often reading between the turns to extract a coherent claim. Be liberal with
llm-wikifor synthesized patterns and with^[inferred]when speakers contradict each other or you're unsure who's right. Write a^[ambiguous]frontmatter block on each new/updated page.provenance:
遵循 skill的流程创建/更新页面:
wiki-ingest- 使用正确的分类目录(、
concepts/、entities/等)skills/ - 添加包含标题、分类、标签、来源的YAML frontmatter
- 使用关联现有页面
[[wikilinks]] - 为声明标注来源
- 在每个新页面的frontmatter中添加字段(1–2句话,≤200字符)回答“这个页面是关于什么的?”——下游skill会读取这个字段,避免打开页面正文。
summary: - 按照的规范应用来源标记。会话、日志和聊天数据通常属于高推断内容——你经常需要在对话轮次之间挖掘信息来提取连贯的声明。对于合成的模式可以大量使用
llm-wiki标记,当发言者互相矛盾或你不确定谁是正确的时使用^[inferred]标记。在每个新建/更新页面的frontmatter中添加^[ambiguous]块。provenance:
Step 5: Update Manifest and Special Files
步骤5:更新Manifest和特殊文件
.manifest.jsonjson
{
"ingested_at": "TIMESTAMP",
"size_bytes": FILE_SIZE,
"modified_at": FILE_MTIME,
"source_type": "data", // or "image" for png/jpg/webp/gif sources
"project": "project-name-or-null",
"pages_created": ["list/of/pages.md"],
"pages_updated": ["list/of/pages.md"]
}index.mdlog.md- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y.manifest.jsonjson
{
"ingested_at": "TIMESTAMP",
"size_bytes": FILE_SIZE,
"modified_at": FILE_MTIME,
"source_type": "data", // or "image" for png/jpg/webp/gif sources
"project": "project-name-or-null",
"pages_created": ["list/of/pages.md"],
"pages_updated": ["list/of/pages.md"]
}index.mdlog.md- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=YTips
提示
- When in doubt about format, just read it. The Read tool will show you what you're dealing with.
- Large files: Read in chunks using offset/limit. Don't try to load a 10MB JSON in one go.
- Multiple files: Process them in order, building up wiki pages incrementally.
- Binary files: Skip them, except images — those are first-class sources via the Read tool's vision support.
- Encoding issues: If you see garbled text, mention it to the user and move on.
- 如果对格式有疑问,直接读取即可。 Read工具会告诉你你正在处理的内容是什么。
- 大文件: 使用offset/limit分块读取。不要尝试一次性加载10MB的JSON文件。
- 多文件: 按顺序处理,逐步构建知识库页面。
- 二进制文件: 跳过,图片除外——通过Read工具的视觉支持,图片是一等数据源。
- 编码问题: 如果你看到乱码文本,告知用户后继续处理其他内容。