RAG Architect

The agent designs, implements, and optimizes production-grade Retrieval-Augmented Generation pipelines, covering the full lifecycle from document chunking through evaluation.

Workflow

Analyse corpus -- Profile the document collection: count, average length, format mix (PDF, HTML, Markdown), language(s), and domain. Validate that sample documents are accessible before proceeding.
Select chunking strategy -- Choose from the Chunking Strategy Matrix based on corpus characteristics. Set chunk size, overlap, and boundary rules. Run a test split on 100 sample documents.
Choose embedding model -- Select an embedding model from the Embedding Model table based on domain, latency budget, and cost constraints. Verify dimension compatibility with the target vector database.
Select vector database -- Pick a vector store from the Vector Database Comparison based on scale, query patterns, and operational requirements.
Design retrieval pipeline -- Configure retrieval strategy (dense, sparse, or hybrid). Add reranking if precision requirements exceed 0.85. Set the top-K parameter and similarity threshold.
Implement query transformations -- If query-document style mismatch exists, enable HyDE. If queries are ambiguous, enable multi-query generation. Validate each transformation improves retrieval metrics on a held-out set.
Configure guardrails -- Enable PII detection, toxicity filtering, hallucination detection, and source attribution. Set confidence score thresholds.
Evaluate end-to-end -- Run the RAGAS evaluation framework. Verify faithfulness > 0.90, context relevance > 0.80, answer relevance > 0.85. Iterate on weak components.

Chunking Strategy Matrix

Strategy	Best For	Chunk Size	Overlap	Pros	Cons
Fixed-size (token)	Uniform docs, consistent sizing	512-2048 tokens	10-20%	Predictable, simple	Breaks semantic units
Sentence-based	Narrative text, articles	3-8 sentences	1 sentence	Preserves language boundaries	Variable sizes
Paragraph-based	Structured docs, technical manuals	1-3 paragraphs	0-1 paragraph	Preserves topic coherence	Highly variable sizes
Semantic	Long-form, research papers	Dynamic	Topic-shift detection	Best coherence	Computationally expensive
Recursive	Mixed content types	Dynamic, multi-level	Per-level	Optimal utilization	Complex implementation
Document-aware	Multi-format collections	Format-specific	Section-level	Preserves metadata	Format-specific code required

Embedding Model Comparison

Model	Dimensions	Speed	Quality	Cost	Best For
all-MiniLM-L6-v2	384	~14K tok/s	Good	Free (local)	Prototyping, low-latency
all-mpnet-base-v2	768	~2.8K tok/s	Better	Free (local)	Balanced production use
text-embedding-3-small	1536	API	High	$0.02/1M tokens	Cost-effective production
text-embedding-3-large	3072	API	Highest	$0.13/1M tokens	Maximum quality
Domain fine-tuned	Varies	Varies	Domain-best	Training cost	Specialized domains (legal, medical)

Vector Database Comparison

Database	Type	Scaling	Key Feature	Best For
Pinecone	Managed	Auto-scaling	Metadata filtering, hybrid search	Production, managed preference
Weaviate	Open source	Horizontal	GraphQL API, multi-modal	Complex data types
Qdrant	Open source	Distributed	High perf, low memory (Rust)	Performance-critical
Chroma	Embedded	Limited	Simple API, SQLite-backed	Prototyping, small-scale
pgvector	PostgreSQL ext	PostgreSQL scaling	ACID, SQL joins	Existing PostgreSQL infra

Retrieval Strategies

Strategy	When to Use	Implementation
Dense (vector similarity)	Default for semantic search	Cosine similarity with k-NN/ANN
Sparse (BM25/TF-IDF)	Exact keyword matching needed	Elasticsearch or inverted index
Hybrid (dense + sparse)	Best of both needed	Reciprocal Rank Fusion (RRF) with tuned weights
+ Reranking	Precision must exceed 0.85	Cross-encoder reranker after initial retrieval

Query Transformation Techniques

Technique	When to Use	How It Works
HyDE	Query/document style mismatch	LLM generates hypothetical answer; embed that instead of query
Multi-query	Ambiguous queries	Generate 3-5 query variations; retrieve for each; deduplicate
Step-back	Specific questions needing general context	Transform to broader query; retrieve general + specific

Context Window Optimization

Relevance ordering: Most relevant chunks first in the context window
Diversity: Deduplicate semantically similar chunks
Token budget: Fit within model context limit; reserve tokens for system prompt and answer
Hierarchical inclusion: Include section summary before detailed chunks when available
Compression: Summarize low-relevance chunks; extract key facts from verbose passages

Evaluation Metrics (RAGAS Framework)

Metric	Target	What It Measures
Faithfulness	> 0.90	Answers grounded in retrieved context
Context Relevance	> 0.80	Retrieved chunks relevant to query
Answer Relevance	> 0.85	Answer addresses the original question
Precision@K	> 0.70	% of top-K results that are relevant
Recall@K	> 0.80	% of relevant docs found in top-K
MRR	> 0.75	Reciprocal rank of first relevant result

Guardrails

PII detection: Scan retrieved chunks and generated responses for PII; redact or block
Hallucination detection: Compare generated claims against source documents via NLI
Source attribution: Every factual claim must cite a retrieved chunk
Confidence scoring: Return confidence level; if below threshold, return "I don't have enough information"
Injection prevention: Sanitize user queries; reject prompt injection attempts

Example: Internal Knowledge Base RAG Pipeline

yaml

corpus:
  documents: 12,000 Confluence pages + 3,000 PDFs
  avg_length: 2,400 tokens
  languages: [English]
  domain: internal engineering docs

pipeline:
  chunking:
    strategy: recursive
    max_tokens: 512
    overlap: 50 tokens
    boundary: paragraph
  embedding:
    model: text-embedding-3-small
    dimensions: 1536
    batch_size: 100
  vector_db:
    engine: pgvector
    index: HNSW (ef_construction=128, m=16)
    reason: "Existing PostgreSQL infra; ACID compliance for audit"
  retrieval:
    strategy: hybrid
    dense_weight: 0.7
    sparse_weight: 0.3
    top_k: 10
    reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
    final_k: 5

evaluation_results:
  faithfulness: 0.93
  context_relevance: 0.84
  answer_relevance: 0.88
  precision_at_5: 0.76
  recall_at_10: 0.85

Production Patterns

Caching: Query-level (exact match), semantic (similar queries via embedding distance < 0.05), chunk-level (embedding cache)
Streaming: Stream generation tokens while retrieval completes; show sources after generation
Fallbacks: If primary vector DB is unavailable, serve from read-replica; if retrieval returns no results above threshold, say so explicitly
Document refresh: Incremental re-embedding on change detection; full re-index weekly
Cost control: Batch embeddings, cache aggressively, route simple queries to BM25 only

Common Pitfalls

Problem	Solution
Chunks break mid-sentence	Use boundary-aware chunking with sentence/paragraph overlap
Low retrieval precision	Add cross-encoder reranker; tune similarity threshold
High latency (> 2s)	Cache embeddings; use faster model; reduce top-K
Inconsistent quality	Implement RAGAS evaluation in CI; add quality scoring
Scalability bottleneck	Shard vector DB; implement auto-scaling; add caching layer

Scripts

Chunking Optimizer

Analyses corpus and recommends optimal chunking strategy with parameters.

Retrieval Evaluator

Runs evaluation suite (precision, recall, MRR, NDCG) against a test query set.

Pipeline Benchmarker

Measures end-to-end latency, throughput, and cost per query across configurations.

Troubleshooting

Problem	Cause	Solution
Chunks contain incomplete sentences or broken code blocks	Fixed-size chunking ignoring semantic boundaries	Switch to sentence-based or semantic (heading-aware) chunking; enable boundary detection in `chunking_optimizer.py`
Retrieved context is relevant but answer is wrong	LLM hallucinating beyond retrieved chunks	Enable faithfulness evaluation via RAGAS; add source attribution guardrails; lower confidence threshold to surface "I don't know" responses
Precision@K below 0.50 despite relevant documents existing	Embedding model does not capture domain vocabulary	Fine-tune embedding model on domain data or switch to a domain-specific model; add cross-encoder reranking stage
Query latency exceeds 2 seconds	Large top-K, no caching, or unoptimized HNSW index	Reduce top-K, enable query-level and semantic caching, tune HNSW parameters (ef_search, m)
Recall drops after adding new documents	Stale embeddings or index fragmentation after incremental inserts	Trigger full re-index; verify new documents pass chunking pipeline; check embedding model version consistency
Hybrid retrieval returns duplicate chunks	Dense and sparse retrievers returning overlapping results without deduplication	Apply Reciprocal Rank Fusion (RRF) with deduplication before reranking; tune dense/sparse weight ratio
Evaluation metrics fluctuate across runs	Non-deterministic embedding batching or insufficient test query set	Fix random seeds, increase evaluation sample size, run evaluations on a frozen ground-truth set

Success Criteria

Faithfulness > 0.90 -- Generated answers are grounded in retrieved context as measured by the RAGAS faithfulness metric.
Context Relevance > 0.80 -- At least 80% of retrieved chunks are relevant to the user query.
Precision@5 > 0.70 -- Seven out of ten top-5 result sets contain only relevant documents.
End-to-end latency < 500ms -- P95 query-to-response latency stays under 500 milliseconds for interactive workloads.
Recall@10 > 0.85 -- The system retrieves at least 85% of relevant documents within the top 10 results.
Chunk boundary quality > 0.80 -- At least 80% of chunks end on clean sentence or paragraph boundaries as reported by
```
chunking_optimizer.py
```
.
Monthly cost within budget -- Total embedding, vector DB, and reranking costs stay within the budget ceiling defined in requirements.

Scope & Limitations

This skill covers:

End-to-end RAG pipeline architecture design: chunking, embedding, vector storage, retrieval, reranking, and evaluation.
Quantitative chunking analysis across four strategy families (fixed-size, sentence, paragraph, semantic).
Retrieval quality evaluation using standard IR metrics (Precision@K, Recall@K, MRR, NDCG) with a built-in TF-IDF baseline.
Automated pipeline design with component selection, cost projection, and Mermaid architecture diagrams.

This skill does NOT cover:

LLM prompt engineering or generation-side optimization -- see
```
engineering/prompt-engineer-toolkit
```
.
Database schema design for metadata stores alongside vector databases -- see
```
engineering/database-designer
```
.
Production observability, alerting, and SLO dashboards for deployed pipelines -- see
```
engineering/observability-designer
```
.
Agent orchestration or multi-step reasoning workflows that sit on top of RAG retrieval -- see
```
engineering/agent-workflow-designer
```
.

Integration Points

Skill	Integration	Data Flow
`engineering/prompt-engineer-toolkit`	Optimize system prompts and few-shot examples fed alongside retrieved chunks	Pipeline design output --> prompt templates that reference chunk format and metadata
`engineering/database-designer`	Design relational metadata stores (tags, access control, source tracking) paired with the vector database	Vector DB recommendation --> metadata schema for hybrid storage
`engineering/observability-designer`	Set up latency, throughput, and accuracy monitoring for the deployed RAG pipeline	Evaluation metrics and SLO targets --> dashboards and alerting rules
`engineering/agent-workflow-designer`	Embed the RAG retrieval step inside multi-agent reasoning workflows	Retrieval config --> agent tool definition with top-K and threshold parameters
`engineering/ci-cd-pipeline-builder`	Automate embedding re-indexing, evaluation regression tests, and deployment on document changes	Evaluation thresholds --> CI gate that blocks deploys when metrics regress
`engineering/api-design-reviewer`	Review the query and ingestion API surface exposed by the RAG service	Pipeline config --> OpenAPI spec review for search and ingest endpoints

Tool Reference

chunking_optimizer.py

Purpose: Analyzes a document corpus and evaluates multiple chunking strategies (fixed-size, sentence-based, paragraph-based, semantic/heading-aware) to recommend the optimal approach with configuration parameters.

Usage:

bash

python chunking_optimizer.py <directory> [options]

Flags / Parameters:

Flag	Type	Default	Description
`directory`	positional, required	--	Directory containing text/markdown documents to analyze
`--output` , `-o`	string	None	Output file path for results in JSON format
`--config` , `-c`	string	None	JSON configuration file to customize strategy parameters (fixed_sizes, overlaps, sentence_max_sizes, paragraph_max_sizes, semantic_max_sizes)
`--extensions`	string list	`.txt .md .markdown`	File extensions to include when scanning the corpus
`--verbose` , `-v`	flag	off	Print all strategy scores in addition to the recommendation

Example:

bash

python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose

Output Formats:

Console -- Corpus summary, recommended strategy name, performance score, reasoning text, and two sample chunks. With
```
--verbose
```
, all strategy scores are listed.
JSON (
```
--output
```
) -- Full results object containing
```
corpus_info
```
,
```
strategy_results
```
(per-strategy size statistics, boundary quality, semantic coherence, vocabulary statistics, performance score),
```
recommendation
```
(best strategy, all scores, reasoning), and
```
sample_chunks
```
.

retrieval_evaluator.py

Purpose: Evaluates retrieval system performance using a built-in TF-IDF baseline retriever and standard information retrieval metrics: Precision@K, Recall@K, MRR, and NDCG. Includes failure analysis and improvement recommendations.

Usage:

bash

python retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]

Flags / Parameters:

Flag	Type	Default	Description
`queries`	positional, required	--	JSON file containing queries (list of `{"id": ..., "query": ...}` objects, or `{"queries": [...]}` )
`corpus`	positional, required	--	Directory containing the document corpus
`ground_truth`	positional, required	--	JSON file mapping query IDs to lists of relevant document IDs
`--output` , `-o`	string	None	Output file path for results in JSON format
`--k-values`	int list	`1 3 5 10`	K values used when computing Precision@K, Recall@K, and NDCG@K
`--extensions`	string list	`.txt .md .markdown`	File extensions to include from the corpus directory
`--verbose` , `-v`	flag	off	Print detailed per-metric values and failure analysis counts

Example:

bash

python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose

Output Formats:

Console -- Evaluation summary table (Precision@1, Precision@5, Recall@5, MRR, NDCG@5) with performance assessment and numbered improvement recommendations. With
```
--verbose
```
, all aggregate metrics and failure analysis counts are printed.
JSON (
```
--output
```
) -- Full results object containing
```
aggregate_metrics
```
,
```
query_results
```
(per-query metrics, retrieved docs, relevant docs),
```
failure_analysis
```
(poor precision/recall counts, zero-result counts, query length analysis, failure patterns),
```
evaluation_summary
```
, and
```
recommendations
```
.

rag_pipeline_designer.py

Purpose: Accepts a system requirements specification and generates a complete RAG pipeline design including component recommendations (chunking, embedding, vector DB, retrieval, reranking, evaluation), cost projections, a Mermaid architecture diagram, and deployment configuration templates.

Usage:

bash

python rag_pipeline_designer.py <requirements> [options]

Flags / Parameters:

Flag	Type	Default	Description
`requirements`	positional, required	--	JSON file containing system requirements (document_types, document_count, avg_document_size, queries_per_day, query_patterns, latency_requirement, budget_monthly, accuracy_priority, cost_priority, maintenance_complexity)
`--output` , `-o`	string	None	Output file path for the pipeline design in JSON format
`--verbose` , `-v`	flag	off	Print full configuration templates for each component

Example:

bash

python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose

Output Formats:

Console -- Design summary with total monthly cost, per-component recommendations (name, rationale, cost), and a Mermaid architecture diagram. With
```
--verbose
```
, full JSON configuration templates for each component are printed.
JSON (
```
--output
```
) -- Complete pipeline design object containing per-component
```
ComponentRecommendation
```
fields (name, type, config, rationale, pros, cons, cost_monthly),
```
total_cost
```
,
```
architecture_diagram
```
(Mermaid markup), and
```
config_templates
```
(per-component configs plus deployment/scaling/monitoring settings).

rag-architect

NPX Install

Tags

SKILL.md Content

RAG Architect

Workflow

Chunking Strategy Matrix

Embedding Model Comparison

Vector Database Comparison

Retrieval Strategies

Query Transformation Techniques

Context Window Optimization

Evaluation Metrics (RAGAS Framework)

Guardrails

Example: Internal Knowledge Base RAG Pipeline

Production Patterns

Common Pitfalls

Scripts

Chunking Optimizer

Retrieval Evaluator

Pipeline Benchmarker

Troubleshooting

Success Criteria

Scope & Limitations

Integration Points

Tool Reference

chunking_optimizer.py

retrieval_evaluator.py

rag_pipeline_designer.py