addon-rag-ingestion-pipeline
Original:🇺🇸 English
Translated
Use when adding multi-format RAG ingest, chunk, embed, and retrieval pipelines; pair with architect-python-uv-batch or architect-python-uv-fastapi-sqlalchemy.
3installs
Sourceajrlewis/ai-skills
Added on
NPX Install
npx skill4agent add ajrlewis/ai-skills addon-rag-ingestion-pipelineTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Add-on: Multi-Format RAG Ingestion Pipeline
Use this skill when an existing project needs RAG ingestion/retrieval across multiple document formats.
Compatibility
- Works with .
architect-python-uv-batch - Works with .
architect-python-uv-fastapi-sqlalchemy - Can back a Next.js app via a Python worker service.
Inputs
Collect:
- : one or more of
SOURCE_FORMATS,pdf,markdown,txt,html.csv - :
EMBED_PROVIDERoropenai.sentence-transformers - :
VECTOR_STORE,pgvector, or existing vector layer.chroma - : default
CHUNK_SIZE.1000 - : default
CHUNK_OVERLAP.150 - : default
TOP_K.5
Integration Workflow
- Add dependencies (Python worker path):
bash
uv add pypdf markdown-it-py beautifulsoup4 pandas langchain-text-splitters- If :
EMBED_PROVIDER=openaiuv add openai - If :
EMBED_PROVIDER=sentence-transformersuv add sentence-transformers - If :
VECTOR_STORE=chromauv add chromadb
- Add modules:
text
src/{{MODULE_NAME}}/rag/
loaders/pdf_loader.py
loaders/markdown_loader.py
loaders/text_loader.py
loaders/html_loader.py
loaders/csv_loader.py
normalize.py
chunking.py
embeddings.py
indexer.py
retriever.py- Use a normalized document contract:
document_idsource_pathsource_typecontent- (filename/page/section/checksum/ingested_at/model_version)
metadata
- Implement ingestion entrypoint:
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown,txt- Implement retrieval entrypoint:
bash
uv run {{PROJECT_NAME}} rag-query --q "question" --top-k 5- Ensure both commands are wired in the project CLI/script entrypoint.
- depends on an existing index from
rag-query; do not run these validation commands in parallel.rag-ingest
Loader Notes
- PDF: extract per page and keep in metadata.
page_number - Markdown: keep heading hierarchy and section anchors in metadata.
- Text: detect encoding fallback (, then
utf-8).latin-1 - HTML: strip script/style tags and preserve title/headings where possible.
- CSV: convert rows into stable textual records with row identifiers.
Minimal Defaults
normalize.py
normalize.pypython
import re
import unicodedata
def normalize_text(raw: str) -> str:
text = unicodedata.normalize("NFKC", raw)
text = text.replace("\r\n", "\n")
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()chunking.py
chunking.pypython
from langchain_text_splitters import RecursiveCharacterTextSplitter
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
return splitter.split_text(text)Guardrails
-
Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
-
Deduplicate ingestion by checksum to keep re-runs idempotent.
-
Store embedding model/version so re-indexing can be reasoned about.
-
Never interpolate user queries into raw SQL vector search.
-
Keep ingestion async/offline for large corpora; do not block request-response paths.
-
Preserve citation metadata for retrieval (, section, page, row id).
source_path
Validation Checklist
- Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
uv run {{PROJECT_NAME}} rag-ingest --source ./data/inbox --formats pdf,markdown
uv run {{PROJECT_NAME}} rag-query --q "smoke test" --top-k 5
uv run pytest -qDecision Justification Rule
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.