cost-aware-llm-pipeline

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cost-Aware LLM Pipeline

成本可控的LLM流水线

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.
在保证质量的同时控制LLM API成本的模式。将模型路由、预算跟踪、重试逻辑和提示缓存整合为一个可组合的流水线。

When to Activate

适用场景

  • Building applications that call LLM APIs (Claude, GPT, etc.)
  • Processing batches of items with varying complexity
  • Need to stay within a budget for API spend
  • Optimizing cost without sacrificing quality on complex tasks
  • 构建调用LLM API(Claude、GPT等)的应用
  • 处理复杂度不同的批量任务
  • 需要将API支出控制在预算范围内
  • 在不降低复杂任务质量的前提下优化成本

Core Concepts

核心概念

1. Model Routing by Task Complexity

1. 按任务复杂度进行模型路由

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.
python
MODEL_SONNET = "claude-sonnet-4-5-20250929"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)
自动为简单任务选择更便宜的模型,将昂贵的模型留给复杂任务。
python
MODEL_SONNET = "claude-sonnet-4-5-20250929"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. Immutable Cost Tracking

2. 不可变成本跟踪

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.
python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit
使用冻结数据类跟踪累计支出。每次API调用都会返回一个新的跟踪器——绝不修改状态。
python
from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. Narrow Retry Logic

3. 精细化重试逻辑

Retry only on transient errors. Fail fast on authentication or bad request errors.
python
from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately
仅针对临时错误进行重试。遇到认证或错误请求时快速失败。
python
from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. Prompt Caching

4. 提示缓存

Cache long system prompts to avoid resending them on every request.
python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]
缓存长系统提示,避免在每次请求时重新发送。
python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

Composition

组合使用

Combine all four techniques in a single pipeline function:
python
def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker
将这四种技术整合到一个流水线函数中:
python
def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

Pricing Reference (2025-2026)

定价参考(2025-2026)

ModelInput ($/1M tokens)Output ($/1M tokens)Relative Cost
Haiku 4.5$0.80$4.001x
Sonnet 4.5$3.00$15.00~4x
Opus 4.5$15.00$75.00~19x
模型输入(美元/百万令牌)输出(美元/百万令牌)相对成本
Haiku 4.5$0.80$4.001x
Sonnet 4.5$3.00$15.00~4x
Opus 4.5$15.00$75.00~19x

Best Practices

最佳实践

  • Start with the cheapest model and only route to expensive models when complexity thresholds are met
  • Set explicit budget limits before processing batches — fail early rather than overspend
  • Log model selection decisions so you can tune thresholds based on real data
  • Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
  • Never retry on authentication or validation errors — only transient failures (network, rate limit, server error)
  • 从最便宜的模型开始,仅当达到复杂度阈值时才路由到昂贵模型
  • 设置明确的预算限制,在处理批量任务前就确定,提前失败而非超支
  • 记录模型选择决策,以便根据真实数据调整阈值
  • 对超过1024令牌的系统提示使用缓存,既节省成本又降低延迟
  • 绝不要对认证或验证错误进行重试——仅针对临时故障(网络、速率限制、服务器错误)

Anti-Patterns to Avoid

需避免的反模式

  • Using the most expensive model for all requests regardless of complexity
  • Retrying on all errors (wastes budget on permanent failures)
  • Mutating cost tracking state (makes debugging and auditing difficult)
  • Hardcoding model names throughout the codebase (use constants or config)
  • Ignoring prompt caching for repetitive system prompts
  • 无论复杂度如何,所有请求都使用最昂贵的模型
  • 对所有错误进行重试(在永久故障上浪费预算)
  • 修改成本跟踪状态(增加调试和审计难度)
  • 在代码库中硬编码模型名称(应使用常量或配置)
  • 忽略重复系统提示的缓存

When to Use

适用场景

  • Any application calling Claude, OpenAI, or similar LLM APIs
  • Batch processing pipelines where cost adds up quickly
  • Multi-model architectures that need intelligent routing
  • Production systems that need budget guardrails
  • 任何调用Claude、OpenAI或类似LLM API的应用
  • 成本快速累积的批量处理流水线
  • 需要智能路由的多模型架构
  • 需要预算防护的生产系统