data-ingest

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Ingest — Universal Text Source Handler

数据导入——通用文本源处理器

You are ingesting arbitrary text data into an Obsidian wiki. The source could be anything — conversation exports, log files, transcripts, data dumps. Your job is to figure out the format, extract knowledge, and distill it into wiki pages.
你正将任意文本数据导入Obsidian知识库。数据源可以是任何类型——会话导出文件、日志文件、转录稿、数据转储。你的任务是识别格式、提取知识,并将其提炼为知识库页面。

Before You Start

开始前准备

  1. Read
    .env
    to get
    OBSIDIAN_VAULT_PATH
  2. Read
    .manifest.json
    at the vault root — check if this source has been ingested before
  3. Read
    index.md
    at the vault root to know what already exists
If the source path is already in
.manifest.json
and the file hasn't been modified since
ingested_at
, tell the user it's already been ingested. Ask if they want to re-ingest anyway.
  1. 读取
    .env
    文件获取
    OBSIDIAN_VAULT_PATH
  2. 读取库根目录下的
    .manifest.json
    ——检查该数据源是否已经被导入过
  3. 读取库根目录下的
    index.md
    了解现有内容
如果数据源路径已经存在于
.manifest.json
中,且文件自
ingested_at
记录的时间后没有被修改,告知用户该内容已被导入,询问他们是否仍要重新导入。

Step 1: Identify the Source Format

步骤1:识别数据源格式

Read the file(s) the user points you at. Common formats you'll encounter:
FormatHow to identifyHow to read
JSON / JSONL
.json
/
.jsonl
extension, starts with
{
or
[
Parse with Read tool, look for message/content fields
Markdown
.md
extension
Read directly
Plain text
.txt
extension or no extension
Read directly
CSV / TSV
.csv
/
.tsv
, comma or tab separated
Parse rows, identify columns
HTML
.html
, starts with
<
Extract text content, ignore markup
Chat exportVaries — look for turn-taking patterns (user/assistant, human/ai, timestamps)Extract the dialogue turns
Images
.png
/
.jpg
/
.jpeg
/
.webp
/
.gif
Requires a vision-capable model. Use the Read tool — it renders images into your context. Screenshots, whiteboards, diagrams all qualify. Models without vision support should skip and report which files were skipped.
读取用户指定的文件。你会遇到的常见格式如下:
格式识别方式读取方式
JSON / JSONL扩展名为
.json
/
.jsonl
,以
{
[
开头
使用Read工具解析,查找message/content字段
Markdown扩展名为
.md
直接读取
纯文本扩展名为
.txt
或无扩展名
直接读取
CSV / TSV扩展名为
.csv
/
.tsv
,用逗号或制表符分隔
解析行,识别列
HTML扩展名为
.html
,以
<
开头
提取文本内容,忽略标记
聊天导出文件不固定——查找轮次对话模式(用户/助手、人类/AI、时间戳)提取对话轮次
图片扩展名为
.png
/
.jpg
/
.jpeg
/
.webp
/
.gif
需要支持视觉能力的模型。 使用Read工具——它会将图片渲染到你的上下文中。截图、白板内容、图表都适用。不支持视觉能力的模型应跳过这类文件,并报告跳过了哪些文件。

Common Chat Export Formats

常见聊天导出格式

ChatGPT export (
conversations.json
):
json
[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]
Slack export (directory of JSON files per channel):
json
[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]
Generic chat log (timestamped text):
[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here
Don't try to handle every format upfront — read the actual data, figure out the structure, and adapt.
ChatGPT导出文件
conversations.json
):
json
[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]
Slack导出文件(每个频道对应一个JSON文件的目录):
json
[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]
通用聊天日志(带时间戳的文本):
[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here
无需提前尝试处理所有格式——读取实际数据,理清结构,然后自适应处理即可。

Images and visual sources

图片和视觉数据源

When the user dumps a folder of screenshots, whiteboard photos, or diagram exports, treat each image as a source:
  • Use the Read tool on the image path — it will render the image into context.
  • Transcribe any visible text verbatim (this is the only extracted content from an image).
  • Describe structure: for diagrams, list nodes/edges; for screenshots, name the app and what's on screen.
  • Extract the concepts the image conveys — what's it about? Most of this is
    ^[inferred]
    .
  • Flag anything you can't read, can't identify, or are guessing at with
    ^[ambiguous]
    .
Image-derived pages will skew heavily inferred — that's expected and the provenance markers will reflect it. Set
source_type: "image"
in the manifest entry. Skip files with EXIF-only changes (re-saved with no visual diff) — compare via the standard delta logic.
For folders of mixed images (e.g. a screenshot timeline of a debugging session), cluster by visible topic rather than per-file. Twenty screenshots of the same UI bug should produce one wiki page, not twenty.
当用户提供包含截图、白板照片或图表导出文件的文件夹时,将每张图片视为一个数据源:
  • 对图片路径使用Read工具——它会将图片渲染到上下文中。
  • 逐字转录所有可见文本(这是从图片中提取的唯一内容)。
  • 描述结构:对于图表,列出节点/边;对于截图,标注应用名称和屏幕显示内容。
  • 提取图片传达的概念——它的主题是什么?大部分内容会标记为
    ^[inferred]
  • 对所有你无法读取、无法识别或猜测的内容用
    ^[ambiguous]
    标记。
从图片生成的页面会包含大量推断内容——这是正常现象,来源标记会体现这一点。在manifest条目中设置
source_type: "image"
。跳过仅修改了EXIF信息的文件(重新保存但无视觉差异)——通过标准差异逻辑进行比对。
对于包含混合图片的文件夹(例如调试过程的截图时间线),按可见主题聚类,而不是按文件拆分。20张同个UI bug的截图应该生成1个知识库页面,而不是20个。

Step 2: Extract Knowledge

步骤2:提取知识

Regardless of format, extract the same things:
  • Topics discussed — what subjects come up?
  • Decisions made — what was concluded or decided?
  • Facts learned — what concrete information is stated?
  • Procedures described — how-to knowledge, workflows, steps
  • Entities mentioned — people, tools, projects, organizations
  • Connections — how do topics relate to each other and to existing wiki content?
无论格式是什么,都提取以下内容:
  • 讨论的主题——涉及哪些话题?
  • 做出的决策——得出了什么结论或做出了什么决定?
  • 获知的事实——陈述了哪些具体信息?
  • 描述的流程——实操知识、工作流、步骤
  • 提及的实体——人物、工具、项目、组织
  • 关联关系——各主题之间以及与现有知识库内容有什么关联?

For conversation data specifically:

针对会话数据的特殊要求:

Focus on the substance, not the dialogue. A 50-message debugging session might yield one skills page about the fix. A long brainstorming chat might yield three concept pages.
Skip:
  • Greetings, pleasantries, meta-conversation ("can you help me with...")
  • Repetitive back-and-forth that doesn't add new information
  • Raw code dumps (unless they illustrate a reusable pattern)
关注实质内容,而不是对话本身。一段包含50条消息的调试会话可能只会生成1个关于修复方案的skill页面。一段漫长的头脑风暴聊天可能会生成3个概念页面。
跳过:
  • 问候、寒暄、元对话("can you help me with...")
  • 没有新增信息的重复来回对话
  • 原始代码转储(除非它们展示了可复用的模式)

Step 3: Cluster and Deduplicate

步骤3:聚类和去重

Before creating pages:
  • Group extracted knowledge by topic (not by source file or conversation)
  • Check existing wiki pages — does this knowledge belong on an existing page?
  • Merge overlapping information from multiple sources
  • Note contradictions between sources
创建页面之前:
  • 按主题对提取的知识分组(不要按源文件或对话分组)
  • 检查现有知识库页面——这些知识是否应该归入现有页面?
  • 合并多个来源的重叠信息
  • 标注不同来源之间的矛盾内容

Step 4: Distill into Wiki Pages

步骤4:提炼为知识库页面

Follow the
wiki-ingest
skill's process for creating/updating pages:
  • Use correct category directories (
    concepts/
    ,
    entities/
    ,
    skills/
    , etc.)
  • Add YAML frontmatter with title, category, tags, sources
  • Use
    [[wikilinks]]
    to connect to existing pages
  • Attribute claims to their source
  • Write a
    summary:
    frontmatter field
    on every new page (1–2 sentences, ≤200 characters) answering "what is this page about?" — this is what downstream skills read to avoid opening the page body.
  • Apply provenance markers per the convention in
    llm-wiki
    . Conversation, log, and chat data tend to be high-inference — you're often reading between the turns to extract a coherent claim. Be liberal with
    ^[inferred]
    for synthesized patterns and with
    ^[ambiguous]
    when speakers contradict each other or you're unsure who's right. Write a
    provenance:
    frontmatter block on each new/updated page.
遵循
wiki-ingest
skill的流程创建/更新页面:
  • 使用正确的分类目录(
    concepts/
    entities/
    skills/
    等)
  • 添加包含标题、分类、标签、来源的YAML frontmatter
  • 使用
    [[wikilinks]]
    关联现有页面
  • 为声明标注来源
  • 在每个新页面的frontmatter中添加
    summary:
    字段
    (1–2句话,≤200字符)回答“这个页面是关于什么的?”——下游skill会读取这个字段,避免打开页面正文。
  • 按照
    llm-wiki
    的规范应用来源标记。会话、日志和聊天数据通常属于高推断内容——你经常需要在对话轮次之间挖掘信息来提取连贯的声明。对于合成的模式可以大量使用
    ^[inferred]
    标记,当发言者互相矛盾或你不确定谁是正确的时使用
    ^[ambiguous]
    标记。在每个新建/更新页面的frontmatter中添加
    provenance:
    块。

Step 5: Update Manifest and Special Files

步骤5:更新Manifest和特殊文件

.manifest.json
— Add an entry for each source file processed:
json
{
  "ingested_at": "TIMESTAMP",
  "size_bytes": FILE_SIZE,
  "modified_at": FILE_MTIME,
  "source_type": "data",  // or "image" for png/jpg/webp/gif sources
  "project": "project-name-or-null",
  "pages_created": ["list/of/pages.md"],
  "pages_updated": ["list/of/pages.md"]
}
index.md
and
log.md
:
- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y
.manifest.json
——为每个处理过的源文件添加一条记录:
json
{
  "ingested_at": "TIMESTAMP",
  "size_bytes": FILE_SIZE,
  "modified_at": FILE_MTIME,
  "source_type": "data",  // or "image" for png/jpg/webp/gif sources
  "project": "project-name-or-null",
  "pages_created": ["list/of/pages.md"],
  "pages_updated": ["list/of/pages.md"]
}
index.md
log.md
:
- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y

Tips

提示

  • When in doubt about format, just read it. The Read tool will show you what you're dealing with.
  • Large files: Read in chunks using offset/limit. Don't try to load a 10MB JSON in one go.
  • Multiple files: Process them in order, building up wiki pages incrementally.
  • Binary files: Skip them, except images — those are first-class sources via the Read tool's vision support.
  • Encoding issues: If you see garbled text, mention it to the user and move on.
  • 如果对格式有疑问,直接读取即可。 Read工具会告诉你你正在处理的内容是什么。
  • 大文件: 使用offset/limit分块读取。不要尝试一次性加载10MB的JSON文件。
  • 多文件: 按顺序处理,逐步构建知识库页面。
  • 二进制文件: 跳过,图片除外——通过Read工具的视觉支持,图片是一等数据源。
  • 编码问题: 如果你看到乱码文本,告知用户后继续处理其他内容。