addon-rag-ingestion-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Add-on: Multi-Format RAG Ingestion Pipeline

附加组件：多格式RAG摄入管道

Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats.

当现有项目需要跨多种文档格式的RAG摄入/检索功能时使用此技能。

Compatibility

兼容性

Works with
```
architect-python-uv-batch
```
.
Works with
```
architect-python-uv-fastapi-sqlalchemy
```
.
Can back a Next.js app via a Python worker service.

适配
```
architect-python-uv-batch
```
。
适配
```
architect-python-uv-fastapi-sqlalchemy
```
。
可通过Python worker服务为Next.js应用提供后端支持。

Inputs

输入参数

Collect:

```
SOURCE_FORMATS
```
: one or more of
```
pdf
```
,
```
markdown
```
,
```
txt
```
,
```
html
```
,
```
csv
```
.

EMBED_PROVIDER

openai

sentence-transformers

```
VECTOR_STORE
```
:
```
pgvector
```
,
```
chroma
```
, or existing vector layer.
```
CHUNK_SIZE
```
: default
```
1000
```
.
```
CHUNK_OVERLAP
```
: default
```
150
```
.
```
TOP_K
```
: default
```
5
```
.

收集以下配置：

```
SOURCE_FORMATS
```
：支持
```
pdf
```
、
```
markdown
```
、
```
txt
```
、
```
html
```
、
```
csv
```
中的一种或多种。

EMBED_PROVIDER

：

openai

或

sentence-transformers

。

```
VECTOR_STORE
```
：
```
pgvector
```
、
```
chroma
```
或现有向量层。
```
CHUNK_SIZE
```
：默认值
```
1000
```
。
```
CHUNK_OVERLAP
```
：默认值
```
150
```
。
```
TOP_K
```
：默认值
```
5
```
。

Integration Workflow

集成工作流

Add dependencies (Python worker path):

bash

uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters

If
```
EMBED_PROVIDER=openai
```
:
```
uv add openai
```

EMBED_PROVIDER=sentence-transformers

uv add sentence-transformers

If
```
VECTOR_STORE=chroma
```
:
```
uv add chromadb
```

Add modules:

text

src/{{MODULE_NAME}}/rag/
  loaders/pdf_loader.py
  loaders/markdown_loader.py
  loaders/text_loader.py
  loaders/html_loader.py
  loaders/csv_loader.py
  normalize.py
  chunking.py
  embeddings.py
  indexer.py
  retriever.py

Use a normalized document contract:

```
document_id
```
```
source_path
```
```
source_type
```
```
content
```
```
metadata
```
(filename/page/section/checksum/ingested_at/model_version)

Implement ingestion entrypoint:

bash

uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt

Implement retrieval entrypoint:

bash

uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5

Ensure both commands are wired in the project CLI/script entrypoint.
```
rag-query
```
depends on an existing index from
```
rag-ingest
```
; do not run these validation commands in parallel.

添加依赖（Python worker路径下执行）：

bash

uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters

如果
```
EMBED_PROVIDER=openai
```
：执行
```
uv add openai
```

如果

EMBED_PROVIDER=sentence-transformers

：执行

uv add sentence-transformers

如果
```
VECTOR_STORE=chroma
```
：执行
```
uv add chromadb
```

新增模块文件：

text

src/{{MODULE_NAME}}/rag/
  loaders/pdf_loader.py
  loaders/markdown_loader.py
  loaders/text_loader.py
  loaders/html_loader.py
  loaders/csv_loader.py
  normalize.py
  chunking.py
  embeddings.py
  indexer.py
  retriever.py

使用标准化文档契约：

```
document_id
```
```
source_path
```
```
source_type
```
```
content
```
```
metadata
```
（文件名/页码/章节/校验和/摄入时间/模型版本）

实现摄入入口：

bash

uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt

实现检索入口：

bash

uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5

确保两个命令都已接入项目CLI/脚本入口。
```
rag-query
```
依赖
```
rag-ingest
```
生成的现有索引，请勿并行运行这些验证命令。

Loader Notes

加载器说明

PDF: extract per page and keep
```
page_number
```
in metadata.
Markdown: keep heading hierarchy and section anchors in metadata.
Text: detect encoding fallback (
```
utf-8
```
, then
```
latin-1
```
).
HTML: strip script/style tags and preserve title/headings where possible.
CSV: convert rows into stable textual records with row identifiers.

PDF：按页提取内容，在元数据中保留
```
page_number
```
（页码）。
Markdown：保留标题层级和章节锚点到元数据中。
Text：检测编码兜底方案（优先
```
utf-8
```
， fallback为
```
latin-1
```
）。
HTML：移除script/style标签，尽可能保留标题/ heading信息。
CSV：将行转换为带行标识符的稳定文本记录。

Minimal Defaults

最小默认实现

normalize.py

normalize.py

python

import re
import unicodedata


def normalize_text(raw: str) -> str:
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace("\r\n", "\n")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

python

import re
import unicodedata


def normalize_text(raw: str) -> str:
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace("\r\n", "\n")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

chunking.py

chunking.py

python

from langchain_text_splitters import RecursiveCharacterTextSplitter


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

python

from langchain_text_splitters import RecursiveCharacterTextSplitter


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

Guardrails

防护规则

Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
Deduplicate ingestion by checksum to keep re-runs idempotent.
Store embedding model/version so re-indexing can be reasoned about.
Never interpolate user queries into raw SQL vector search.
Keep ingestion async/offline for large corpora; do not block request-response paths.
Preserve citation metadata for retrieval (
```
source_path
```
, section, page, row id).

生成代码的文档契约：
- Python：为模块、公共类、方法和函数编写文档字符串。
- Next.js/TypeScript：为导出的组件、hooks、工具函数和路由处理器编写JSDoc。
- 仅为非直观逻辑、不变量或安全约束添加简洁的说明注释。
- 即使使用下方模板片段也需遵守该契约，可按需扩展模板。
通过校验和对摄入内容去重，保证重复运行的幂等性。
存储嵌入模型/版本信息，方便后续重索引操作溯源。
永远不要将用户查询直接拼接进原生SQL向量搜索语句。
大型语料库的摄入操作采用异步/离线执行，不要阻塞请求响应链路。
保留检索所需的引用元数据（
```
source_path
```
、章节、页码、行ID）。

Validation Checklist

验证检查清单

Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.

bash

uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q

确认生成代码包含必填的文档字符串/JSDoc，以及非直观逻辑的说明注释。

bash

uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q

Decision Justification Rule

决策说明规则

Every non-trivial decision must include a concrete justification.
Capture the alternatives considered and why they were rejected.
State tradeoffs and residual risks for the chosen option.
If justification is missing, treat the task as incomplete and surface it as a blocker.

所有非 trivial 决策必须包含具体的理由说明。
记录考虑过的替代方案以及被否决的原因。
说明所选方案的权衡和残留风险。
如果缺少理由说明，视为任务未完成，将其作为阻塞项上报。

addon-rag-ingestion-pipeline

Original

Translation

Add-on: Multi-Format RAG Ingestion Pipeline

附加组件：多格式RAG摄入管道

Compatibility

兼容性

Inputs

输入参数

Integration Workflow

集成工作流

Loader Notes

加载器说明

Minimal Defaults

最小默认实现

`normalize.py`

`normalize.py`

`chunking.py`

`chunking.py`

Guardrails

防护规则

Validation Checklist

验证检查清单

Decision Justification Rule

决策说明规则