prompt-caching
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrompt Caching Skill
Prompt缓存技能
Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.
利用Anthropic的prompt缓存功能,大幅降低重复prompt的延迟和成本。
When to Use This Skill
何时使用该技能
- RAG systems with large static documents
- Multi-turn conversations with long instructions
- Code analysis with large codebase context
- Batch processing with shared prefixes
- Document analysis and summarization
- 包含大型静态文档的RAG系统
- 带有冗长指令的多轮对话
- 针对大型代码库上下文的代码分析
- 带有共享前缀的批处理
- 文档分析与总结
Core Concepts
核心概念
Cache Control Placement
缓存控制位置
python
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to a large knowledge base...",
"cache_control": {"type": "ephemeral"} # Cache this content
}
],
messages=[{"role": "user", "content": "What is...?"}]
)python
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to a large knowledge base...",
"cache_control": {"type": "ephemeral"} # Cache this content
}
],
messages=[{"role": "user", "content": "What is...?"}]
)Cache Hierarchy
缓存层级
Cache breakpoints are checked in this order:
- Tools - Tool definitions cached first
- System - System prompts cached second
- Messages - Conversation history cached last
缓存断点按以下顺序检查:
- 工具 - 工具定义优先缓存
- 系统提示词 - 系统提示词其次缓存
- 对话消息 - 对话历史最后缓存
TTL Options
TTL选项
| TTL | Write Cost | Read Cost | Use Case |
|---|---|---|---|
| 5 minutes (default) | 1.25x base | 0.1x base | Interactive sessions |
| 1 hour | 2.0x base | 0.1x base | Batch processing, stable docs |
| TTL时长 | 写入成本 | 读取成本 | 使用场景 |
|---|---|---|---|
| 5分钟(默认) | 基础成本的1.25倍 | 基础成本的0.1倍 | 交互式会话 |
| 1小时 | 基础成本的2.0倍 | 基础成本的0.1倍 | 批处理、稳定文档 |
Cache Requirements
缓存要求
- Minimum tokens: 1024-4096 (varies by model)
- Maximum breakpoints: 4 per request
- Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5
- 最小token数:1024-4096(因模型而异)
- 最大断点数量:每个请求最多4个
- 支持的模型:Claude Opus 4.5、Sonnet 4.5、Haiku 4.5
Implementation Patterns
实现模式
Pattern 1: Single Breakpoint (Recommended)
模式1:单一断点(推荐)
python
undefinedpython
undefinedBest for: Document analysis, Q&A with static context
Best for: Document analysis, Q&A with static context
system = [
{
"type": "text",
"text": large_document_content,
"cache_control": {"type": "ephemeral"} # Single breakpoint at end
}
]
undefinedsystem = [
{
"type": "text",
"text": large_document_content,
"cache_control": {"type": "ephemeral"} # Single breakpoint at end
}
]
undefinedPattern 2: Multi-Turn Conversation
模式2:多轮对话
python
undefinedpython
undefinedCache grows with conversation
Cache grows with conversation
messages = [
{"role": "user", "content": "First question"},
{"role": "assistant", "content": "First answer"},
{
"role": "user",
"content": "Follow-up question",
"cache_control": {"type": "ephemeral"} # Cache entire conversation
}
]
undefinedmessages = [
{"role": "user", "content": "First question"},
{"role": "assistant", "content": "First answer"},
{
"role": "user",
"content": "Follow-up question",
"cache_control": {"type": "ephemeral"} # Cache entire conversation
}
]
undefinedPattern 3: RAG with Multiple Breakpoints
模式3:多断点RAG
python
system = [
{
"type": "text",
"text": "Tool definitions and instructions",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools
},
{
"type": "text",
"text": retrieved_documents,
"cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents
}
]python
system = [
{
"type": "text",
"text": "Tool definitions and instructions",
"cache_control": {"type": "ephemeral"} # Breakpoint 1: Tools
},
{
"type": "text",
"text": retrieved_documents,
"cache_control": {"type": "ephemeral"} # Breakpoint 2: Documents
}
]Pattern 4: Batch Processing with 1-Hour TTL
模式4:1小时TTL的批处理
python
undefinedpython
undefinedWarm the cache before batch
Warm the cache before batch
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": "Initialize cache"}]
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": "Initialize cache"}]
)
Now run batch - all requests hit the cache
Now run batch - all requests hit the cache
for item in batch_items:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": item}]
)
undefinedfor item in batch_items:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": shared_context,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": item}]
)
undefinedPerformance Monitoring
性能监控
Check Cache Usage
检查缓存使用率
python
response = client.messages.create(...)python
response = client.messages.create(...)Monitor these fields
Monitor these fields
cache_write = response.usage.cache_creation_input_tokens # New cache written
cache_read = response.usage.cache_read_input_tokens # Cache hit!
uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
undefinedcache_write = response.usage.cache_creation_input_tokens # New cache written
cache_read = response.usage.cache_read_input_tokens # Cache hit!
uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
undefinedCost Calculation
成本计算
python
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
# Example rates (check current pricing)
base_input_rate = 0.003 # per 1K tokens
write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate
return write_cost + read_cost + uncached_costpython
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
# Example rates (check current pricing)
base_input_rate = 0.003 # per 1K tokens
write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
uncached_cost = (usage.input_tokens / 1000) * base_input_rate
return write_cost + read_cost + uncached_costCache Invalidation
缓存失效
Changes that invalidate cache:
| Change | Impact |
|---|---|
| Tool definitions | Entire cache invalidated |
| System prompt | System + messages invalidated |
| Any content before breakpoint | That breakpoint + later invalidated |
会导致缓存失效的变更:
| 变更内容 | 影响 |
|---|---|
| 工具定义 | 整个缓存失效 |
| 系统提示词 | 系统提示词+对话消息缓存失效 |
| 断点前的任何内容 | 该断点及后续缓存失效 |
Best Practices
最佳实践
DO:
建议:
- Place breakpoint at END of static content
- Keep tools/instructions stable across requests
- Use 1-hour TTL for batch processing
- Monitor cache_read_input_tokens for savings
- 将断点放在静态内容的末尾
- 在请求之间保持工具/指令稳定
- 批处理使用1小时TTL
- 监控cache_read_input_tokens以查看节省情况
DON'T:
不建议:
- Place breakpoint in middle of dynamic content
- Change tool definitions frequently
- Expect cache to work with <1024 tokens
- Ignore the 20-block lookback limit
- 在动态内容中间放置断点
- 频繁更改工具定义
- 期望缓存对少于1024个token的内容生效
- 忽略20块回溯限制
Integration with Extended Thinking
与扩展思考功能集成
python
undefinedpython
undefinedCache + Extended Thinking
Cache + Extended Thinking
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
system=[{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Analyze this..."}]
)
undefinedresponse = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
system=[{
"type": "text",
"text": large_context,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Analyze this..."}]
)
undefinedSee Also
相关链接
- [[llm-integration]] - Claude API basics
- [[extended-thinking]] - Deep reasoning
- [[batch-processing]] - Bulk processing
- [[llm-integration]] - Claude API基础
- [[extended-thinking]] - 深度推理
- [[batch-processing]] - 批量处理