context-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Context Optimization Techniques

上下文优化技术

Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. Effective optimization can double or triple effective context capacity without requiring larger models or longer windows — but only when applied with discipline. The techniques below are ordered by impact and risk.
上下文优化通过战略性的压缩、掩码、缓存和分区,扩展了有限上下文窗口的有效容量。有效的优化可以在无需更大模型或更长窗口的情况下,将有效上下文容量提升2-3倍——但必须规范应用才能实现。以下技术按影响和风险排序。

When to Activate

触发场景

Activate this skill when:
  • Context limits constrain task complexity
  • Optimizing for cost reduction (fewer tokens = lower costs)
  • Reducing latency for long conversations
  • Implementing long-running agent systems
  • Needing to handle larger documents or conversations
  • Building production systems at scale
在以下场景中激活此技能:
  • 上下文限制制约了任务复杂度
  • 以降低成本为优化目标(Token越少,成本越低)
  • 减少长对话的延迟
  • 部署长期运行的Agent系统
  • 需要处理更大的文档或对话
  • 构建规模化的生产系统

Core Concepts

核心概念

Apply four primary strategies in this priority order:
  1. KV-cache optimization — Reorder and stabilize prompt structure so the inference engine reuses cached Key/Value tensors. This is the cheapest optimization: zero quality risk, immediate cost and latency savings. Apply it first and unconditionally.
  2. Observation masking — Replace verbose tool outputs with compact references once their purpose has been served. Tool outputs consume 80%+ of tokens in typical agent trajectories, so masking them yields the largest capacity gains. The original content remains retrievable if needed downstream.
  3. Compaction — Summarize accumulated context when utilization exceeds 70%, then reinitialize with the summary. This distills the window's contents while preserving task-critical state. Compaction is lossy — apply it after masking has already removed the low-value bulk.
  4. Context partitioning — Split work across sub-agents with isolated contexts when a single window cannot hold the full problem. Each sub-agent operates in a clean context focused on its subtask. Reserve this for tasks where estimated context exceeds 60% of the window limit, because coordination overhead is real.
The governing principle: context quality matters more than quantity. Every optimization preserves signal while reducing noise. Measure before optimizing, then measure the optimization's effect.
按以下优先级应用四大核心策略:
  1. KV-cache优化——重新排序并稳定提示词结构,使推理引擎复用缓存的Key/Value张量。这是成本最低的优化方式:无质量风险,可立即节省成本并降低延迟。应首先且无条件应用。
  2. 观察掩码——在工具输出的用途完成后,用简洁的引用替代冗长的输出。在典型的Agent轨迹中,工具输出消耗80%以上的Token,因此掩码操作能带来最大的容量提升。原始内容仍可在后续需要时检索。
  3. 压缩——当上下文利用率超过70%时,对累积的上下文进行总结,然后用总结内容重新初始化上下文。这会提炼窗口内容,同时保留任务关键状态。压缩是有损耗的——应在掩码操作移除低价值内容后再应用。
  4. 上下文分区——当单个窗口无法容纳完整问题时,将工作拆分到具有独立上下文的子Agent中。每个子Agent在专注于其子任务的干净上下文中运行。仅当预估上下文超过窗口限制的60%时才使用此策略,因为协调会产生实际开销。
核心原则:上下文质量比数量更重要。每一项优化都在减少噪声的同时保留信号。优化前先进行度量,然后再评估优化效果。

Detailed Topics

详细技术

Compaction Strategies

压缩策略

Trigger compaction when context utilization exceeds 70%: summarize the current context, then reinitialize with the summary. This distills the window's contents in a high-fidelity manner, enabling continuation with minimal performance degradation. Prioritize compressing tool outputs first (they consume 80%+ of tokens), then old conversation turns, then retrieved documents. Never compress the system prompt — it anchors model behavior and its removal causes unpredictable degradation.
Preserve different elements by message type:
  • Tool outputs: Extract key findings, metrics, error codes, and conclusions. Strip verbose raw output, stack traces (unless debugging is ongoing), and boilerplate headers.
  • Conversational turns: Retain decisions, commitments, user preferences, and context shifts. Remove filler, pleasantries, and exploratory back-and-forth that led to a conclusion already captured.
  • Retrieved documents: Keep claims, facts, and data points relevant to the active task. Remove supporting evidence and elaboration that served a one-time reasoning purpose.
Target 50-70% token reduction with less than 5% quality degradation. If compaction exceeds 70% reduction, audit the summary for critical information loss — over-aggressive compaction is the most common failure mode.
当上下文利用率超过70%时触发压缩:总结当前上下文,然后用总结内容重新初始化。这能以高保真度提炼窗口内容,使任务得以继续且性能下降最小。优先压缩工具输出(它们消耗80%以上的Token),然后是旧对话轮次,最后是检索到的文档。切勿压缩系统提示词——它是模型行为的锚点,移除会导致不可预测的性能下降。
按消息类型保留不同元素:
  • 工具输出:提取关键发现、指标、错误代码和结论。去除冗长的原始输出、堆栈跟踪(除非正在调试)和模板化的页眉。
  • 对话轮次:保留决策、承诺、用户偏好和上下文转换。去除填充语、客套话以及已被结论覆盖的探索性来回对话。
  • 检索到的文档:保留与当前任务相关的主张、事实和数据点。去除仅用于一次性推理目的的支持证据和详细说明。
目标是实现50-70%的Token减少,且质量下降低于5%。如果压缩率超过70%,需检查总结内容是否丢失关键信息——过度压缩是最常见的失败模式。

Observation Masking

观察掩码

Mask observations selectively based on recency and ongoing relevance — not uniformly. Apply these rules:
  • Never mask: Observations critical to the current task, observations from the most recent turn, observations used in active reasoning chains, and error outputs when debugging is in progress.
  • Mask after 3+ turns: Verbose outputs whose key points have already been extracted into the conversation flow. Replace with a compact reference:
    [Obs:{ref_id} elided. Key: {summary}. Full content retrievable.]
  • Always mask immediately: Repeated/duplicate outputs, boilerplate headers and footers, outputs already summarized earlier in the conversation.
Masking should achieve 60-80% reduction in masked observations with less than 2% quality impact. The key is maintaining retrievability — store the full content externally and keep the reference ID in context so the agent can request the original if needed.
根据时效性和持续相关性选择性地掩码观察结果——而非统一掩码。遵循以下规则:
  • 绝不掩码:对当前任务至关重要的观察结果、最近一轮的观察结果、活跃推理链中使用的观察结果,以及调试过程中的错误输出。
  • 3轮后掩码:其关键点已融入对话流程的冗长输出。用简洁引用替代:
    [Obs:{ref_id} 已省略。关键信息: {summary}。可检索完整内容。]
  • 立即掩码:重复/重复输出、模板化页眉页脚、对话中已被总结过的输出。
掩码应实现60-80%的掩码观察结果减少,且质量影响低于2%。关键是保持可检索性——将完整内容存储在外部,并在上下文中保留引用ID,以便Agent在需要时请求原始内容。

KV-Cache Optimization

KV-cache优化

Maximize prefix cache hits by structuring prompts so that stable content occupies the prefix and dynamic content appears at the end. KV-cache stores Key and Value tensors computed during inference; when consecutive requests share an identical prefix, the cached tensors are reused, saving both cost and latency.
Apply this ordering in every prompt:
  1. System prompt (most stable — never changes within a session)
  2. Tool definitions (stable across requests)
  3. Frequently reused templates and few-shot examples
  4. Conversation history (grows but shares prefix with prior turns)
  5. Current query and dynamic content (least stable — always last)
Design prompts for cache stability: remove timestamps, session counters, and request IDs from the system prompt. Move dynamic metadata into a separate user message or tool result where it does not break the prefix. Even a single whitespace change in the prefix invalidates the entire cached block downstream of that change.
Target 70%+ cache hit rate for stable workloads. At scale, this translates to 50%+ cost reduction and 40%+ latency reduction on cached tokens.
通过构建提示词结构,使稳定内容位于前缀部分,动态内容位于末尾,从而最大化前缀缓存命中率。KV-cache存储推理过程中计算的Key和Value张量;当连续请求共享相同前缀时,会复用缓存的张量,从而节省成本和延迟。
在所有提示词中遵循以下顺序:
  1. 系统提示词(最稳定——会话内绝不更改)
  2. 工具定义(跨请求稳定)
  3. 频繁复用的模板和少样本示例
  4. 对话历史(会增长但与先前轮次共享前缀)
  5. 当前查询和动态内容(最不稳定——始终放在最后)
为缓存稳定性设计提示词:从系统提示词中移除时间戳、会话计数器和请求ID。将动态元数据移至单独的用户消息或工具结果中,避免破坏前缀。前缀中的哪怕一个空格变化都会使该点之后的整个缓存块失效。
稳定工作负载的目标缓存命中率为70%以上。在规模化场景下,这意味着缓存Token的成本降低50%以上,延迟降低40%以上。

Context Partitioning

上下文分区

Partition work across sub-agents when a single context cannot hold the full problem without triggering aggressive compaction. Each sub-agent operates in a clean, focused context for its subtask, then returns a structured result to a coordinator agent.
Plan partitioning when estimated task context exceeds 60% of the window limit. Decompose the task into independent subtasks, assign each to a sub-agent, and aggregate results. Validate that all partitions completed before merging, merge compatible results, and apply summarization if the aggregated output still exceeds budget.
This approach achieves separation of concerns — detailed search context stays isolated within sub-agents while the coordinator focuses on synthesis. However, coordination has real token cost: the coordinator prompt, result aggregation, and error handling all consume tokens. Only partition when the savings exceed this overhead.
当单个上下文无法容纳完整问题而必须进行激进压缩时,将工作拆分到子Agent中。每个子Agent在专注于其子任务的干净上下文中运行,然后将结构化结果返回给协调Agent。
当预估任务上下文超过窗口限制的60%时,规划分区。将任务分解为独立子任务,为每个子任务分配子Agent,然后聚合结果。在合并前验证所有分区已完成,合并兼容结果,如果聚合输出仍超出预算则应用总结。
这种方法实现了关注点分离——详细的搜索上下文在子Agent中保持隔离,而协调Agent专注于综合。然而,协调会产生实际的Token成本:协调Agent的提示词、结果聚合和错误处理都会消耗Token。仅当节省的成本超过此开销时才进行分区。

Budget Management

预算管理

Allocate explicit token budgets across context categories before the session begins: system prompt, tool definitions, retrieved documents, message history, tool outputs, and a reserved buffer (5-10% of total). Monitor usage against budget continuously and trigger optimization when any category exceeds its allocation or total utilization crosses 70%.
Use trigger-based optimization rather than periodic optimization. Monitor these signals:
  • Token utilization above 80% — trigger compaction
  • Attention degradation indicators (repetition, missed instructions) — trigger masking + compaction
  • Quality score drops below baseline — audit context composition before optimizing
在会话开始前,为不同类别的上下文分配明确的Token预算:系统提示词、工具定义、检索到的文档、消息历史、工具输出,以及预留缓冲区(占总预算的5-10%)。持续监控预算使用情况,当任何类别超出分配或总利用率超过70%时触发优化。
使用基于触发条件的优化而非周期性优化。监控以下信号:
  • Token利用率超过80%——触发压缩
  • 注意力退化指标(重复内容、遗漏指令)——触发掩码+压缩
  • 质量分数低于基线——在优化前检查上下文构成

Practical Guidance

实践指南

Optimization Decision Framework

优化决策框架

Select the optimization technique based on what dominates the context:
Context CompositionFirst ActionSecond Action
Tool outputs dominate (>50%)Observation maskingCompaction of remaining turns
Retrieved documents dominateSummarizationPartitioning if docs are independent
Message history dominatesCompaction with selective preservationPartitioning for new subtasks
Multiple components contributeKV-cache optimization first, then layer masking + compaction
Near-limit with active debuggingMask resolved tool outputs only — preserve error details
根据上下文的主要构成选择优化技术:
上下文构成首要操作次要操作
工具输出占主导(>50%)观察掩码压缩剩余对话轮次
检索文档占主导总结若文档独立则进行分区
消息历史占主导选择性保留的压缩为新子任务进行分区
多组件共同影响先进行KV-cache优化,再叠加掩码+压缩
接近限制且正在调试仅掩码已解决的工具输出——保留错误细节

Performance Targets

性能目标

Track these metrics to validate optimization effectiveness:
  • Compaction: 50-70% token reduction, <5% quality degradation, <10% latency overhead from the compaction step itself
  • Masking: 60-80% reduction in masked observations, <2% quality impact, near-zero latency overhead
  • Cache optimization: 70%+ hit rate for stable workloads, 50%+ cost reduction, 40%+ latency reduction
  • Partitioning: Net token savings after accounting for coordinator overhead; break-even typically requires 3+ subtasks
Iterate on strategies based on measured results. If an optimization technique does not measurably improve the target metric, remove it — optimization machinery itself consumes tokens and adds latency.
跟踪以下指标以验证优化效果:
  • 压缩:Token减少50-70%,质量下降<5%,压缩步骤本身的延迟开销<10%
  • 掩码:掩码观察结果减少60-80%,质量影响<2%,延迟开销接近零
  • 缓存优化:稳定工作负载的命中率70%以上,成本降低50%以上,延迟降低40%以上
  • 分区:扣除协调开销后的净Token节省;通常需要3个以上子任务才能实现收支平衡
根据测量结果迭代策略。如果某项优化技术未显著改善目标指标,则移除它——优化机制本身也会消耗Token并增加延迟。

Examples

示例

Example 1: Compaction Trigger
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
Example 2: Observation Masking
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
Example 3: Cache-Friendly Ordering
python
undefined
示例1:压缩触发
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
示例2:观察掩码
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} 已省略。关键信息: {extract_key(observation)}]"
示例3:缓存友好的提示词顺序
python
undefined

Stable content first

稳定内容在前

context = [system_prompt, tool_definitions] # Cacheable context += [reused_templates] # Reusable context += [unique_content] # Unique
undefined
context = [system_prompt, tool_definitions] # 可缓存 context += [reused_templates] # 可复用 context += [unique_content] # 唯一内容
undefined

Guidelines

指导原则

  1. Measure before optimizing—know your current state
  2. Apply masking before compaction — remove low-value bulk first, then summarize what remains
  3. Design for cache stability with consistent prompts
  4. Partition before context becomes problematic
  5. Monitor optimization effectiveness over time
  6. Balance token savings against quality preservation
  7. Test optimization at production scale
  8. Implement graceful degradation for edge cases
  1. 优化前先度量——了解当前状态
  2. 先应用掩码再压缩——先移除低价值内容,再总结剩余部分
  3. 设计具有一致提示词的缓存稳定性
  4. 在上下文出现问题前进行分区
  5. 持续监控优化效果
  6. 在Token节省与质量保留之间取得平衡
  7. 在生产规模下测试优化
  8. 为边缘情况实现优雅降级

Gotchas

注意事项

  1. Whitespace breaks KV-cache: Even a single whitespace or newline change in the prompt prefix invalidates the entire KV-cache block downstream of that point. Pin system prompts as immutable strings — do not interpolate timestamps, version numbers, or session IDs into them. Diff prompt templates byte-for-byte between deployments.
  2. Timestamps in system prompts destroy cache hit rates: Including
    Current date: {today}
    or similar dynamic content in the system prompt forces a full cache miss on every new day (or every request, if using time-of-day). Move dynamic metadata into a user message or a separate tool result appended after the stable prefix.
  3. Compaction under pressure loses critical state: When the model performing compaction is itself under context pressure (>85% utilization), its summarization quality degrades — it omits task goals, drops user constraints, and flattens nuanced state. Trigger compaction at 70-80%, not 90%+. If compaction must happen late, use a separate model call with a clean context containing only the material to summarize.
  4. Masking error outputs breaks debugging loops: Over-aggressive masking hides error messages, stack traces, and failure details that the agent needs in subsequent turns to diagnose and fix issues. During active debugging (error in the last 3 turns), suspend masking for all error-related observations until the issue is resolved.
  5. Partitioning overhead can exceed savings: Each sub-agent requires its own system prompt, tool definitions, and coordination messages. For tasks with fewer than 3 independent subtasks, the coordination overhead often exceeds the context savings. Estimate total tokens (coordinator + all sub-agents) before committing to partitioning.
  6. Cache miss cost spikes after deployment changes: Reordering tools, rewording the system prompt, or changing few-shot examples between deployments invalidates the entire prefix cache, causing a temporary cost spike of 2-5x until the new cache warms up. Roll out prompt changes gradually and monitor cache hit rate during deployment windows.
  7. Compaction creates false confidence in stale summaries: Once context is compacted, the summary looks authoritative but may reflect outdated state. If the task has evolved since compaction (new user requirements, corrected assumptions), the summary silently carries forward stale information. After compaction, re-validate the summary against the current task goal before proceeding.
  1. 空格会破坏KV-cache:提示词前缀中的哪怕一个空格或换行变化都会使该点之后的整个KV-cache块失效。将系统提示词固定为不可变字符串——不要在其中插入时间戳、版本号或会话ID。部署前逐字节对比提示词模板。
  2. 系统提示词中的时间戳会彻底降低缓存命中率:在系统提示词中包含
    Current date: {today}
    或类似动态内容,会导致每天(或每个请求,如果使用一天中的时间)都完全缓存未命中。将动态元数据移至用户消息或附加在稳定前缀后的单独工具结果中。
  3. 压力下的压缩会丢失关键状态:当执行压缩的模型自身处于上下文压力下(利用率>85%)时,其总结质量会下降——会遗漏任务目标、忽略用户约束、扁平化细微状态。在70-80%利用率时触发压缩,而非90%以上。如果必须在后期进行压缩,请使用单独的模型调用,在仅包含待总结内容的干净上下文中执行。
  4. 掩码错误输出会破坏调试循环:过度掩码会隐藏Agent在后续轮次中诊断和修复问题所需的错误消息、堆栈跟踪和故障细节。在活跃调试期间(最近3轮出现错误),暂停对所有与错误相关的观察结果的掩码,直到问题解决。
  5. 分区开销可能超过节省的成本:每个子Agent都需要自己的系统提示词、工具定义和协调消息。对于少于3个独立子任务的任务,协调开销通常会超过上下文节省的成本。在决定分区前,估算总Token(协调Agent + 所有子Agent)。
  6. 部署变更后缓存未命中成本飙升:部署之间调整工具顺序、修改系统提示词或更改少样本示例,会使整个前缀缓存失效,导致成本暂时飙升2-5倍,直到新缓存预热完成。逐步推出提示词变更,并在部署窗口期间监控缓存命中率。
  7. 压缩会对过时总结产生错误信任:上下文被压缩后,总结看起来权威,但可能反映的是过时状态。如果自压缩以来任务已发生变化(新的用户需求、修正的假设),总结会无声地传递过时信息。压缩后,在继续之前需根据当前任务目标重新验证总结。

Integration

集成

This skill builds on context-fundamentals and context-degradation. It connects to:
  • multi-agent-patterns - Partitioning as isolation
  • evaluation - Measuring optimization effectiveness
  • memory-systems - Offloading context to memory
此技能基于context-fundamentals和context-degradation构建。它与以下内容相关:
  • multi-agent-patterns - 作为隔离手段的分区
  • evaluation - 测量优化效果
  • memory-systems - 将上下文卸载到内存

References

参考资料

Internal reference:
  • Optimization Techniques Reference - Read when: implementing a specific optimization technique and needing detailed code patterns, threshold tables, or integration examples beyond what the skill body provides
Related skills in this collection:
  • context-fundamentals - Read when: unfamiliar with context window mechanics, token counting, or attention distribution basics
  • context-degradation - Read when: diagnosing why agent performance has dropped and needing to identify which degradation pattern is occurring before selecting an optimization
  • evaluation - Read when: setting up metrics and benchmarks to measure whether an optimization technique actually improved outcomes
External resources:
  • Research on context window limitations - Read when: evaluating model-specific context behavior (e.g., lost-in-the-middle effects, attention decay curves)
  • KV-cache optimization techniques - Read when: implementing prefix caching at the inference infrastructure level (vLLM, TGI, or cloud provider APIs)
  • Production engineering guides - Read when: deploying context optimization in a production pipeline and needing operability patterns (monitoring, alerting, rollback)

内部参考:
  • 优化技术参考 - 当需要实现特定优化技术,且需要技能主体之外的详细代码模式、阈值表或集成示例时阅读
本集合中的相关技能:
  • context-fundamentals - 当不熟悉上下文窗口机制、Token计数或注意力分布基础知识时阅读
  • context-degradation - 当诊断Agent性能下降原因,且需要在选择优化之前确定发生了哪种退化模式时阅读
  • evaluation - 当需要设置指标和基准来衡量优化技术是否真正改善了结果时阅读
外部资源:
  • 上下文窗口限制研究 - 当评估特定模型的上下文行为(例如“迷失在中间”效应、注意力衰减曲线)时阅读
  • KV-cache优化技术 - 当在推理基础设施层面实现前缀缓存(vLLM、TGI或云服务商API)时阅读
  • 生产工程指南 - 当在生产流水线中部署上下文优化,且需要可操作性模式(监控、告警、回滚)时阅读

Skill Metadata

技能元数据

Created: 2025-12-20 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 2.0.0
创建时间: 2025-12-20 最后更新: 2026-03-17 作者: Agent Skills for Context Engineering Contributors 版本: 2.0.0