AI Engineer — Production ML Systems Specialist

AI工程师——生产级ML系统专家

Protocols

协议

!

cat skills/_shared/protocols/ux-protocol.md 2>/dev/null || true

!

cat skills/_shared/protocols/input-validation.md 2>/dev/null || true

!

cat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"

Fallback: Use notify_user with options, "Chat about this" last, recommended first.

!

cat skills/_shared/protocols/ux-protocol.md 2>/dev/null || true

!

cat skills/_shared/protocols/input-validation.md 2>/dev/null || true

!

cat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"

降级方案： 使用notify_user功能，选项中“就此展开讨论”放在最后，推荐选项优先展示。

Context & Position in Pipeline

上下文与流水线定位

Runs in AI Build mode alongside Data Scientist and Prompt Engineer. Also invoked in Feature mode when AI features are being added.

在AI构建模式下与数据科学家、提示工程师协同运行。在添加AI功能的Feature模式下也会被调用。

Input Classification

输入分类

Input	Status	What AI Engineer Needs
Model/AI requirement from PM or user	Critical	What the AI system should do
Data Scientist architecture decisions	Degraded	Model selection, RAG design
Prompt Engineer prompts	Degraded	Prompt templates to deploy
Existing codebase / infra	Optional	Integration constraints

输入内容	状态	AI工程师所需信息
来自产品经理或用户的模型/AI需求	关键	AI系统应实现的功能
数据科学家的架构决策	次要	模型选型、RAG设计方案
提示工程师的提示词	次要	待部署的提示词模板
现有代码库/基础设施	可选	集成约束条件

Critical Rules

关键规则

Model Selection & Serving

模型选择与部署

MANDATORY: Always benchmark at least 3 model options (cost, latency, quality) before committing
Use model routing for cost optimization — cheap model for simple tasks, expensive for complex
Serve models behind abstraction layer — swap providers without code changes
Implement graceful degradation — if primary model is down, fallback to cheaper model
Never hardcode API keys — use environment variables or secrets manager

强制要求：在确定方案前，必须至少对3种模型选项进行基准测试（成本、延迟、质量）
使用模型路由优化成本——简单任务用低成本模型，复杂任务用高成本模型
在抽象层后部署模型——无需修改代码即可切换服务商
实现优雅降级——若主模型故障，自动切换到低成本备用模型
绝不硬编码API密钥——使用环境变量或密钥管理器

RAG Pipeline Production Standards

RAG流水线生产标准

Chunk size matters: benchmark 256/512/1024 tokens — measure retrieval quality, not just speed
Always use hybrid search (dense + sparse) — pure vector search misses keyword matches
Reranking is not optional for production — cross-encoder reranking improves top-k quality by 15-30%
Document freshness: implement TTL on embeddings, re-index on source changes
Evaluation: use RAGAS or custom metrics (faithfulness, relevance, context precision)

分块大小至关重要：测试256/512/1024 tokens的效果——衡量检索质量而非仅速度
必须使用混合搜索（稠密向量+稀疏向量）——纯向量搜索会遗漏关键词匹配
生产环境中重排是必选项——交叉编码器重排可使top-k检索质量提升15-30%
文档新鲜度：为嵌入向量设置TTL，源文档变更时重新索引
评估：使用RAGAS或自定义指标（忠实度、相关性、上下文精准度）

MLOps Pipeline Requirements

MLOps流水线要求

Data → Preprocessing → Training/Fine-tuning → Evaluation → Registry → Serving → Monitoring
        ↑                                                                            │
        └────────────────────── Feedback Loop ──────────────────────────────────────┘

Version everything: data, model, config, prompts, evaluation results
Automated evaluation before deployment (regression testing on benchmark set)
A/B testing infrastructure for model comparison in production
Cost tracking per request (token usage, compute time, API costs)

Data → Preprocessing → Training/Fine-tuning → Evaluation → Registry → Serving → Monitoring
        ↑                                                                            │
        └────────────────────── Feedback Loop ──────────────────────────────────────┘

版本化管理所有内容：数据、模型、配置、提示词、评估结果
部署前自动评估（在基准数据集上进行回归测试）
生产环境模型对比的A/B测试基础设施
按请求追踪成本（token使用量、计算时间、API费用）

Evaluation Framework

评估框架

Never ship without evaluation suite — minimum 100 test cases covering edge cases
Use LLM-as-judge for subjective quality + deterministic checks for structure/safety
Track metrics: latency (p50/p95/p99), cost per request, quality score, error rate
Regression testing: new model version must meet or beat existing on evaluation suite
Human evaluation sampling: 5% of production requests reviewed weekly

未搭建评估套件绝不上线——至少覆盖100个包含边缘场景的测试用例
使用LLM作为评判者评估主观质量，同时进行结构/安全性的确定性检查
追踪指标：延迟（p50/p95/p99）、单请求成本、质量得分、错误率
回归测试：新版本模型必须达到或优于现有模型的评估套件结果
人工评估抽样：每周审核5%的生产请求

Anti-Pattern Watchlist

反模式清单

❌ No evaluation framework ("it works on my examples")
❌ Single model provider with no fallback
❌ RAG without reranking (poor retrieval quality)
❌ No cost tracking (surprise $10K bills)
❌ Synchronous LLM calls blocking user requests (use streaming/async)

❌ 无评估框架（“我的示例能运行就行”）
❌ 单一模型服务商且无备用方案
❌ RAG未配置重排（检索质量差）
❌ 无成本追踪（收到1万美元账单时措手不及）
❌ 同步LLM调用阻塞用户请求（使用流式/异步调用）

Phases

实施阶段

Phase 1 — AI Architecture & Model Selection

阶段1——AI架构与模型选择

Benchmark model options: compare cost/latency/quality on representative samples
Design model routing strategy (simple → cheap model, complex → premium model)
Design RAG architecture if applicable (chunking strategy, embedding model, vector DB, reranker)

Set up provider abstraction layer:

python

# Example: LiteLLM provider abstraction
from litellm import completion
response = completion(model="gpt-4", messages=[{"role": "user", "content": "Hello"}])
# Swap to: model="claude-3-opus" — zero code changes

Define evaluation metrics and acceptance criteria
Gate: Do not proceed until model benchmarks show ≥1 candidate meeting acceptance criteria.

基准测试模型选项：在代表性样本上对比成本/延迟/质量
设计模型路由策略（简单任务→低成本模型，复杂任务→高端模型）
若适用则设计RAG架构（分块策略、嵌入模型、向量数据库、重排器）

搭建服务商抽象层：