cost-aware-llm-pipeline

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cost-Aware LLM Pipeline

成本可控的LLM流水线

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.

在保证质量的同时控制LLM API成本的模式。将模型路由、预算跟踪、重试逻辑和提示缓存整合为一个可组合的流水线。

When to Activate

适用场景

Building applications that call LLM APIs (Claude, GPT, etc.)
Processing batches of items with varying complexity
Need to stay within a budget for API spend
Optimizing cost without sacrificing quality on complex tasks

构建调用LLM API（Claude、GPT等）的应用
处理复杂度不同的批量任务
需要将API支出控制在预算范围内
在不降低复杂任务质量的前提下优化成本

Core Concepts

核心概念

1. Model Routing by Task Complexity

1. 按任务复杂度进行模型路由

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.

python

MODEL_SONNET = "claude-sonnet-4-5-20250929"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

自动为简单任务选择更便宜的模型，将昂贵的模型留给复杂任务。

python

MODEL_SONNET = "claude-sonnet-4-5-20250929"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. Immutable Cost Tracking

2. 不可变成本跟踪

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

python

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

使用冻结数据类跟踪累计支出。每次API调用都会返回一个新的跟踪器——绝不修改状态。

python

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. Narrow Retry Logic

3. 精细化重试逻辑

Retry only on transient errors. Fail fast on authentication or bad request errors.

python

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

仅针对临时错误进行重试。遇到认证或错误请求时快速失败。

python

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. Prompt Caching

4. 提示缓存

Cache long system prompts to avoid resending them on every request.

python

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

缓存长系统提示，避免在每次请求时重新发送。

python

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

Composition

组合使用

Combine all four techniques in a single pipeline function:

python

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

将这四种技术整合到一个流水线函数中：

python

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

Pricing Reference (2025-2026)

定价参考（2025-2026）

Model	Input ($/1M tokens)	Output ($/1M tokens)	Relative Cost
Haiku 4.5	$0.80	$4.00	1x
Sonnet 4.5	$3.00	$15.00	~4x
Opus 4.5	$15.00	$75.00	~19x

模型	输入（美元/百万令牌）	输出（美元/百万令牌）	相对成本
Haiku 4.5	$0.80	$4.00	1x
Sonnet 4.5	$3.00	$15.00	~4x
Opus 4.5	$15.00	$75.00	~19x

Best Practices

最佳实践

Start with the cheapest model and only route to expensive models when complexity thresholds are met
Set explicit budget limits before processing batches — fail early rather than overspend
Log model selection decisions so you can tune thresholds based on real data
Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
Never retry on authentication or validation errors — only transient failures (network, rate limit, server error)

从最便宜的模型开始，仅当达到复杂度阈值时才路由到昂贵模型
设置明确的预算限制，在处理批量任务前就确定，提前失败而非超支
记录模型选择决策，以便根据真实数据调整阈值
对超过1024令牌的系统提示使用缓存，既节省成本又降低延迟
绝不要对认证或验证错误进行重试——仅针对临时故障（网络、速率限制、服务器错误）

Anti-Patterns to Avoid

需避免的反模式

Using the most expensive model for all requests regardless of complexity
Retrying on all errors (wastes budget on permanent failures)
Mutating cost tracking state (makes debugging and auditing difficult)
Hardcoding model names throughout the codebase (use constants or config)
Ignoring prompt caching for repetitive system prompts

无论复杂度如何，所有请求都使用最昂贵的模型
对所有错误进行重试（在永久故障上浪费预算）
修改成本跟踪状态（增加调试和审计难度）
在代码库中硬编码模型名称（应使用常量或配置）
忽略重复系统提示的缓存

When to Use

适用场景

Any application calling Claude, OpenAI, or similar LLM APIs
Batch processing pipelines where cost adds up quickly
Multi-model architectures that need intelligent routing
Production systems that need budget guardrails

任何调用Claude、OpenAI或类似LLM API的应用
成本快速累积的批量处理流水线
需要智能路由的多模型架构
需要预算防护的生产系统