addon-rag-ingestion-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Add-on: Multi-Format RAG Ingestion Pipeline

附加组件:多格式RAG摄入管道

Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats.
当现有项目需要跨多种文档格式的RAG摄入/检索功能时使用此技能。

Compatibility

兼容性

  • Works with
    architect-python-uv-batch
    .
  • Works with
    architect-python-uv-fastapi-sqlalchemy
    .
  • Can back a Next.js app via a Python worker service.
  • 适配
    architect-python-uv-batch
  • 适配
    architect-python-uv-fastapi-sqlalchemy
  • 可通过Python worker服务为Next.js应用提供后端支持。

Inputs

输入参数

Collect:
  • SOURCE_FORMATS
    : one or more of
    pdf
    ,
    markdown
    ,
    txt
    ,
    html
    ,
    csv
    .
  • EMBED_PROVIDER
    :
    openai
    or
    sentence-transformers
    .
  • VECTOR_STORE
    :
    pgvector
    ,
    chroma
    , or existing vector layer.
  • CHUNK_SIZE
    : default
    1000
    .
  • CHUNK_OVERLAP
    : default
    150
    .
  • TOP_K
    : default
    5
    .
收集以下配置:
  • SOURCE_FORMATS
    :支持
    pdf
    markdown
    txt
    html
    csv
    中的一种或多种。
  • EMBED_PROVIDER
    openai
    sentence-transformers
  • VECTOR_STORE
    pgvector
    chroma
    或现有向量层。
  • CHUNK_SIZE
    :默认值
    1000
  • CHUNK_OVERLAP
    :默认值
    150
  • TOP_K
    :默认值
    5

Integration Workflow

集成工作流

  1. Add dependencies (Python worker path):
bash
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters
  • If
    EMBED_PROVIDER=openai
    :
    uv add openai
  • If
    EMBED_PROVIDER=sentence-transformers
    :
    uv add sentence-transformers
  • If
    VECTOR_STORE=chroma
    :
    uv add chromadb
  1. Add modules:
text
src/{{MODULE_NAME}}/rag/
  loaders/pdf_loader.py
  loaders/markdown_loader.py
  loaders/text_loader.py
  loaders/html_loader.py
  loaders/csv_loader.py
  normalize.py
  chunking.py
  embeddings.py
  indexer.py
  retriever.py
  1. Use a normalized document contract:
  • document_id
  • source_path
  • source_type
  • content
  • metadata
    (filename/page/section/checksum/ingested_at/model_version)
  1. Implement ingestion entrypoint:
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt
  1. Implement retrieval entrypoint:
bash
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5
  • Ensure both commands are wired in the project CLI/script entrypoint.
  • rag-query
    depends on an existing index from
    rag-ingest
    ; do not run these validation commands in parallel.
  1. 添加依赖(Python worker路径下执行):
bash
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters
  • 如果
    EMBED_PROVIDER=openai
    :执行
    uv add openai
  • 如果
    EMBED_PROVIDER=sentence-transformers
    :执行
    uv add sentence-transformers
  • 如果
    VECTOR_STORE=chroma
    :执行
    uv add chromadb
  1. 新增模块文件:
text
src/{{MODULE_NAME}}/rag/
  loaders/pdf_loader.py
  loaders/markdown_loader.py
  loaders/text_loader.py
  loaders/html_loader.py
  loaders/csv_loader.py
  normalize.py
  chunking.py
  embeddings.py
  indexer.py
  retriever.py
  1. 使用标准化文档契约:
  • document_id
  • source_path
  • source_type
  • content
  • metadata
    (文件名/页码/章节/校验和/摄入时间/模型版本)
  1. 实现摄入入口:
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt
  1. 实现检索入口:
bash
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5
  • 确保两个命令都已接入项目CLI/脚本入口。
  • rag-query
    依赖
    rag-ingest
    生成的现有索引,请勿并行运行这些验证命令。

Loader Notes

加载器说明

  • PDF: extract per page and keep
    page_number
    in metadata.
  • Markdown: keep heading hierarchy and section anchors in metadata.
  • Text: detect encoding fallback (
    utf-8
    , then
    latin-1
    ).
  • HTML: strip script/style tags and preserve title/headings where possible.
  • CSV: convert rows into stable textual records with row identifiers.
  • PDF:按页提取内容,在元数据中保留
    page_number
    (页码)。
  • Markdown:保留标题层级和章节锚点到元数据中。
  • Text:检测编码兜底方案(优先
    utf-8
    , fallback为
    latin-1
    )。
  • HTML:移除script/style标签,尽可能保留标题/ heading信息。
  • CSV:将行转换为带行标识符的稳定文本记录。

Minimal Defaults

最小默认实现

normalize.py

normalize.py

python
import re
import unicodedata


def normalize_text(raw: str) -> str:
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace("\r\n", "\n")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()
python
import re
import unicodedata


def normalize_text(raw: str) -> str:
    text = unicodedata.normalize("NFKC", raw)
    text = text.replace("\r\n", "\n")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

chunking.py

chunking.py

python
from langchain_text_splitters import RecursiveCharacterTextSplitter


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)
python
from langchain_text_splitters import RecursiveCharacterTextSplitter


def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

Guardrails

防护规则

  • Documentation contract for generated code:
    • Python: write module docstrings and docstrings for public classes, methods, and functions.
    • Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
    • Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
    • Apply this contract even when using template snippets below; expand templates as needed.
  • Deduplicate ingestion by checksum to keep re-runs idempotent.
  • Store embedding model/version so re-indexing can be reasoned about.
  • Never interpolate user queries into raw SQL vector search.
  • Keep ingestion async/offline for large corpora; do not block request-response paths.
  • Preserve citation metadata for retrieval (
    source_path
    , section, page, row id).
  • 生成代码的文档契约:
    • Python:为模块、公共类、方法和函数编写文档字符串。
    • Next.js/TypeScript:为导出的组件、hooks、工具函数和路由处理器编写JSDoc。
    • 仅为非直观逻辑、不变量或安全约束添加简洁的说明注释。
    • 即使使用下方模板片段也需遵守该契约,可按需扩展模板。
  • 通过校验和对摄入内容去重,保证重复运行的幂等性。
  • 存储嵌入模型/版本信息,方便后续重索引操作溯源。
  • 永远不要将用户查询直接拼接进原生SQL向量搜索语句。
  • 大型语料库的摄入操作采用异步/离线执行,不要阻塞请求响应链路。
  • 保留检索所需的引用元数据(
    source_path
    、章节、页码、行ID)。

Validation Checklist

验证检查清单

  • Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q
  • 确认生成代码包含必填的文档字符串/JSDoc,以及非直观逻辑的说明注释。
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -q

Decision Justification Rule

决策说明规则

  • Every non-trivial decision must include a concrete justification.
  • Capture the alternatives considered and why they were rejected.
  • State tradeoffs and residual risks for the chosen option.
  • If justification is missing, treat the task as incomplete and surface it as a blocker.
  • 所有非 trivial 决策必须包含具体的理由说明。
  • 记录考虑过的替代方案以及被否决的原因。
  • 说明所选方案的权衡和残留风险。
  • 如果缺少理由说明,视为任务未完成,将其作为阻塞项上报。