context-degradation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Context Degradation Patterns

上下文退化模式

Diagnose and fix context failures before they cascade. Context degradation is not binary — it is a continuum that manifests through five distinct, predictable patterns: lost-in-middle, poisoning, distraction, confusion, and clash. Each pattern has specific detection signals and mitigation strategies. Treat degradation as an engineering problem with measurable thresholds, not an unpredictable failure mode.

在上下文故障扩散之前对其进行诊断和修复。上下文退化并非二元状态——它是一个连续体，通过五种截然不同、可预测的模式表现出来：lost-in-middle、污染、分心、混淆和冲突。每种模式都有特定的检测信号和缓解策略。将退化视为一个具有可衡量阈值的工程问题，而非不可预测的故障模式。

When to Activate

触发场景

Activate this skill when:

Agent performance degrades unexpectedly during long conversations
Debugging cases where agents produce incorrect or irrelevant outputs
Designing systems that must handle large contexts reliably
Evaluating context engineering choices for production systems
Investigating "lost in middle" phenomena in agent outputs
Analyzing context-related failures in agent behavior

在以下场景中激活该技能：

长对话过程中Agent性能意外下降
调试Agent生成错误或无关输出的案例
设计需可靠处理大上下文的系统
评估生产系统的上下文工程方案
调查Agent输出中的“lost in middle”现象
分析Agent行为中与上下文相关的故障

Core Concepts

核心概念

Structure context placement around the attention U-curve: beginning and end positions receive reliable attention, while middle positions suffer 10-40% reduced recall accuracy (Liu et al., 2023). This is not a model bug but a consequence of attention mechanics — the first token (often BOS) acts as an "attention sink" that absorbs disproportionate attention budget, leaving middle tokens under-attended as context grows.

Treat context poisoning as a circuit breaker problem. Once a hallucination, tool error, or incorrect retrieved fact enters context, it compounds through repeated self-reference. A poisoned goals section causes every downstream decision to reinforce incorrect assumptions. Detection requires tracking claim provenance; recovery requires truncating to before the poisoning point or restarting with verified-only context.

Filter aggressively before loading context — even a single irrelevant document measurably degrades performance on relevant tasks. Models cannot "skip" irrelevant context; they must attend to everything provided, creating attention competition between relevant and irrelevant content. Move information that might be needed but is not immediately relevant behind tool calls instead of pre-loading it.

Isolate task contexts to prevent confusion. When context contains multiple task types or switches between objectives, models incorporate constraints from the wrong task, call tools appropriate for a different context, or blend requirements from multiple sources. Explicit task segmentation with separate context windows eliminates cross-contamination.

Resolve context clash through priority rules, not accumulation. When multiple correct-but-contradictory sources appear in context (version conflicts, perspective conflicts, multi-source retrieval), models cannot determine which applies. Mark contradictions explicitly, establish source precedence, and filter outdated versions before they enter context.

围绕注意力U型曲线构建上下文布局：开头和结尾位置会获得稳定的注意力，而中间位置的信息召回准确率会降低10-40%（Liu等人，2023）。这并非模型漏洞，而是注意力机制的必然结果——第一个token（通常是BOS）充当“注意力吸收点”，占用了过多的注意力预算，导致随着上下文长度增加，中间token的关注度不足。

将上下文污染视为断路器问题。一旦幻觉、工具错误或错误的检索事实进入上下文，就会通过重复自引用不断恶化。被污染的目标部分会导致后续所有决策都强化错误假设。检测需要跟踪声明的来源；恢复需要截断到污染发生前的内容，或仅使用已验证的上下文重新启动。

在加载上下文前进行严格过滤——即使是单个无关文档也会显著降低相关任务的性能。模型无法“跳过”无关上下文；它们必须处理所有提供的内容，这会导致相关内容和无关内容之间的注意力竞争。将可能需要但并非当前必需的信息移至工具调用后方，而非预加载到上下文中。

隔离任务上下文以避免混淆。当上下文包含多种任务类型或在目标之间切换时，模型会纳入错误任务的约束、调用适用于其他上下文的工具，或混合多个来源的需求。通过单独的上下文窗口进行明确的任务分段，可消除交叉污染。

通过优先级规则而非内容累积解决上下文冲突。当上下文中出现多个正确但矛盾的来源（版本冲突、视角冲突、多源检索）时，模型无法判断哪个适用。需明确标记矛盾、建立来源优先级，并在过时版本进入上下文前将其过滤。

Detailed Topics

详细主题

Lost-in-Middle: Detection and Placement Strategy

Lost-in-Middle：检测与布局策略

Place critical information at the beginning and end of context, never in the middle. The U-shaped attention curve means middle-positioned information suffers 10-40% reduced recall accuracy. For contexts over 4K tokens, this effect becomes significant.

Use summary structures that surface key findings at attention-favored positions. Add explicit section headers and structural markers — these help models navigate long contexts by creating attention anchors. When a document must be included in full, prepend a summary of its key points and append the critical conclusions.

Monitor for lost-in-middle symptoms: correct information exists in context but the model ignores it, responses contradict provided data, or the model "forgets" instructions given earlier in a long prompt.

将关键信息放在上下文的开头和结尾，绝不要放在中间。U型注意力曲线意味着中间位置的信息召回准确率会降低10-40%。对于超过4K token的上下文，这种影响会变得显著。

使用能在注意力偏好位置呈现关键结论的摘要结构。添加明确的章节标题和结构标记——这些元素可通过创建注意力锚点帮助模型导航长上下文。当必须完整包含文档时，在开头添加其要点摘要，并在结尾附上关键结论。

监控lost-in-middle的症状：正确信息存在于上下文中但被模型忽略、响应与提供的数据矛盾，或模型“忘记”了长提示中较早给出的指令。

Context Poisoning: Prevention and Recovery

上下文污染：预防与恢复

Validate all external inputs before they enter context. Tool outputs, retrieved documents, and model-generated summaries are the three primary poisoning vectors. Each introduces unverified claims that subsequent reasoning treats as ground truth.

Detect poisoning through these signals: degraded output quality on previously-successful tasks, tool misalignment (wrong tools or parameters), and hallucinations that persist despite explicit correction. When these cluster, suspect poisoning rather than model capability issues.

Recover by removing poisoned content, not by adding corrections on top. Truncate to before the poisoning point, restart with clean context preserving only verified information, or explicitly mark the poisoned section and request re-evaluation from scratch. Layering corrections over poisoned context rarely works — the original errors retain attention weight.

在外部输入进入上下文前验证所有内容。工具输出、检索到的文档和模型生成的摘要是三个主要的污染来源。每个来源都会引入未经验证的声明，后续推理会将其视为事实依据。

通过以下信号检测污染：先前成功的任务输出质量下降、工具不匹配（错误的工具或参数），以及即使明确纠正仍持续存在的幻觉。当这些信号集中出现时，应怀疑是污染问题而非模型能力问题。

通过移除污染内容而非添加修正内容来恢复。截断到污染发生前的位置，使用仅保留已验证信息的干净上下文重新启动，或明确标记污染部分并要求从头重新评估。在污染上下文之上叠加修正几乎无效——原始错误仍会占据注意力权重。

Context Distraction: Curation Over Accumulation

上下文分心：精选而非累积

Curate what enters context rather than relying on models to ignore irrelevant content. Research shows even a single distractor document triggers measurable performance degradation — the effect follows a step function, not a linear curve. Multiple distractors compound the problem.

Apply relevance filtering before loading retrieved documents. Use namespacing and structural organization to make section boundaries clear. Prefer tool-call-based access over pre-loading: store reference material behind retrieval tools so it enters context only when directly relevant to the current reasoning step.

精选进入上下文的内容，而非依赖模型忽略无关内容。研究表明，即使是单个干扰文档也会引发可衡量的性能下降——这种影响遵循阶跃函数而非线性曲线。多个干扰源会使问题加剧。

在加载检索到的文档前应用相关性过滤。使用命名空间和结构化组织使章节边界清晰。优先选择基于工具调用的访问方式而非预加载：将参考资料存储在检索工具后方，使其仅在与当前推理步骤直接相关时才进入上下文。

Context Confusion: Task Isolation

上下文混淆：任务隔离

Segment different tasks into separate context windows. Context confusion is distinct from distraction — it concerns the model applying wrong-context constraints to the current task, not just attention dilution. Signs include responses addressing the wrong aspect of a query, tool calls appropriate for a different task, and outputs mixing requirements from multiple sources.

Implement clear transitions between task contexts. Use state management that isolates objectives, constraints, and tool definitions per task. When task-switching within a single session is unavoidable, use explicit "context reset" markers that signal which constraints apply to the current segment.

将不同任务分割到单独的上下文窗口中。上下文混淆与分心不同——它涉及模型将错误上下文的约束应用于当前任务，而非仅仅是注意力分散。症状包括响应针对查询的错误方面、调用适用于其他任务的工具，以及输出混合多个来源的需求。

在任务上下文之间实现清晰的过渡。使用状态管理为每个任务隔离目标、约束和工具定义。当在单个会话中不可避免地需要切换任务时，使用明确的“上下文重置”标记来指示当前段适用的约束。

Context Clash: Conflict Resolution Protocols

上下文冲突：冲突解决协议

Establish source priority rules before conflicts arise. Context clash differs from poisoning — multiple pieces of information are individually correct but mutually contradictory (version conflicts, perspective differences, multi-source retrieval with divergent facts).

Implement version filtering to exclude outdated information before it enters context. When contradictions are unavoidable, mark them explicitly with structured conflict annotations: state what conflicts, which source each claim comes from, and which source takes precedence. Without explicit priority rules, models resolve contradictions unpredictably.

在冲突出现前建立来源优先级规则。上下文冲突与污染不同——多条信息各自正确但相互矛盾（版本冲突、视角差异、存在分歧事实的多源检索）。

实现版本过滤，在过时信息进入上下文前将其排除。当矛盾不可避免时，使用结构化冲突注释明确标记：说明冲突内容、每个声明的来源，以及哪个来源具有优先级。如果没有明确的优先级规则，模型会以不可预测的方式解决矛盾。

Empirical Benchmarks and Thresholds

实证基准与阈值

Use these benchmarks to set design constraints — not as universal truths. The RULER benchmark found only 50% of models claiming 32K+ context maintain satisfactory performance at that length. Near-perfect needle-in-haystack scores do not predict real-world long-context performance.

Model-Specific Degradation Thresholds

Degradation onset varies significantly by model family and task type. As a general rule, expect degradation to begin at 60-70% of the advertised context window for complex retrieval tasks (RULER benchmark found only 50% of models claiming 32K+ context maintain satisfactory performance at that length). Key patterns:

Models with extended thinking reduce hallucination through step-by-step verification but at higher latency and token cost
Models optimized for agents/coding tend to have better attention management for tool-output-heavy contexts
Models with very large context windows (1M+) handle more raw context but still follow U-shaped degradation curves — bigger windows do not eliminate the problem, they delay it

Always benchmark degradation thresholds with your specific workload rather than relying on published benchmarks. Model-specific thresholds go stale with each model update (see Gotcha 2).

使用这些基准设置设计约束——不要将其视为普遍真理。RULER基准测试发现，声称支持32K+上下文的模型中，只有50%能在该长度下保持令人满意的性能。近乎完美的“大海捞针”分数无法预测现实场景中的长上下文性能。

模型特定的退化阈值

退化的起始点因模型家族和任务类型而异。一般而言，对于复杂检索任务，退化会在达到宣传上下文窗口的60-70%时开始（RULER基准测试发现，声称支持32K+上下文的模型中，只有50%能在该长度下保持令人满意的性能）。关键模式：

具备扩展思维的模型通过逐步验证减少幻觉，但会带来更高的延迟和token成本
针对Agent/代码优化的模型在处理工具输出密集的上下文时，注意力管理能力通常更好
具有超大上下文窗口（1M+）的模型能处理更多原始上下文，但仍遵循U型退化曲线——更大的窗口并不能消除问题，只是延迟了问题的出现

始终使用您的特定工作负载对退化阈值进行基准测试，而非依赖已发布的基准。模型特定的阈值会随每次模型更新而过时（参见注意事项2）。

Counterintuitive Findings

违反直觉的发现

Account for these research-backed surprises when designing context strategies:

Shuffled context can outperform coherent context. Studies found incoherent (shuffled) haystacks produce better retrieval performance than logically ordered ones. Coherent context creates false associations that confuse retrieval; incoherent context forces exact matching. Do not assume that better-organized context always yields better results — test both arrangements.

Single distractors have outsized impact. The performance hit from one irrelevant document is disproportionately large compared to adding more distractors after the first. Treat distractor prevention as binary: either keep context clean or accept significant degradation.

Low needle-question similarity accelerates degradation. Tasks requiring inference across dissimilar content degrade faster with context length than tasks with high surface-level similarity. Design retrieval to maximize semantic overlap between queries and retrieved content.

在设计上下文策略时，需考虑这些有研究支持的意外发现：

打乱顺序的上下文可能优于连贯的上下文。研究发现，不连贯（打乱顺序）的“草堆”比逻辑有序的“草堆”产生更好的检索性能。连贯的上下文会产生错误关联，干扰检索；不连贯的上下文会迫使模型进行精确匹配。不要假设组织更良好的上下文总能产生更好的结果——需测试两种排列方式。

单个干扰源的影响被放大。单个无关文档导致的性能损失，比在第一个之后添加更多干扰源的影响要大得多。将干扰源预防视为二元问题：要么保持上下文干净，要么接受显著的性能下降。

低相似度的目标-问题会加速退化。需要跨不同内容进行推理的任务，其性能随上下文长度的下降速度，比具有高表面相似度的任务更快。设计检索时，应最大化查询与检索内容之间的语义重叠。

When Larger Contexts Hurt

更大上下文反而有害的场景

Do not assume larger context windows improve performance. Performance remains stable up to a model-specific threshold, then degrades rapidly — the curve is non-linear with a cliff edge, not a gentle slope. For many models, meaningful degradation begins at 8K-16K tokens even when windows support much larger sizes.

Factor in cost: processing a 400K token context costs exponentially more than 200K in both time and compute, not linearly more. For many applications, this makes large-context processing economically impractical.

Recognize the cognitive bottleneck: even with infinite context, asking a single model to maintain quality across dozens of independent tasks creates degradation that more context cannot solve. Split tasks across sub-agents instead of expanding context.

不要假设更大的上下文窗口能提升性能。性能在达到模型特定的阈值前保持稳定，然后迅速退化——曲线是非线性的，存在悬崖边缘，而非平缓的斜率。对于许多模型，即使窗口支持更大的尺寸，有意义的退化也会在8K-16K token时开始。

考虑成本：处理400K token的上下文在时间和计算上的成本，比处理200K token的成本呈指数级增长，而非线性增长。对于许多应用而言，大上下文处理在经济上并不划算。

认识到认知瓶颈：即使有无限的上下文，让单个模型在数十个独立任务中保持质量，也会导致退化，而更多上下文无法解决这个问题。应将任务拆分到多个子Agent，而非扩展上下文。

Practical Guidance

实践指南

The Four-Bucket Mitigation Framework

四桶缓解框架

Apply these four strategies based on which degradation pattern is active:

Write — Save context outside the window using scratchpads, file systems, or external storage. Use when context utilization exceeds 70% of the window. This keeps active context lean while preserving information access through tool calls.

Select — Pull only relevant context into the window through retrieval, filtering, and prioritization. Use when distraction or confusion symptoms appear. Apply relevance scoring before loading; exclude anything below threshold rather than including everything available.

Compress — Reduce tokens while preserving information through summarization, abstraction, and observation masking. Use when context is growing but all content is relevant. Replace verbose tool outputs with compact structured summaries; abstract repeated patterns into single references.

Isolate — Split context across sub-agents or sessions to prevent any single context from growing past its degradation threshold. Use when confusion or clash symptoms appear, or when tasks are independent. This is the most aggressive strategy but often the most effective for complex multi-task systems.

根据激活的退化模式应用以下四种策略：

写入——使用草稿本、文件系统或外部存储将上下文保存到窗口之外。当上下文利用率超过窗口的70%时使用此策略。这能保持活跃上下文简洁，同时通过工具调用保留信息访问权限。

选择——通过检索、过滤和优先级排序，仅将相关上下文拉入窗口。当出现分心或混淆症状时使用此策略。在加载前应用相关性评分；排除所有低于阈值的内容，而非包含所有可用内容。

压缩——通过摘要、抽象和观察掩码减少token数量，同时保留信息。当上下文不断增长但所有内容都相关时使用此策略。用紧凑的结构化摘要替换冗长的工具输出；将重复模式抽象为单个引用。

隔离——将上下文拆分到多个子Agent或会话中，防止任何单个上下文超过其退化阈值。当出现混淆或冲突症状，或任务相互独立时使用此策略。这是最激进的策略，但通常对复杂的多任务系统最有效。

Architectural Patterns for Resilience

弹性架构模式

Implement just-in-time context loading: retrieve information only when the current reasoning step needs it, not preemptively. Use observation masking to replace verbose tool outputs with compact references after processing. Deploy sub-agent architectures where each agent holds only task-relevant context. Trigger compaction before context exceeds the model-specific degradation onset threshold — not after symptoms appear.

实现即时上下文加载：仅在当前推理步骤需要时才检索信息，而非预先检索。处理后使用观察掩码将冗长的工具输出替换为紧凑引用。部署子Agent架构，每个Agent仅持有与任务相关的上下文。在上下文超过模型特定的退化起始阈值前触发压缩——而非在症状出现后。

Examples

示例

Example 1: Detecting Degradation

yaml

undefined

示例1：检测退化

yaml

undefined

Context grows during long conversation

turn_1: 1000 tokens turn_5: 8000 tokens turn_10: 25000 tokens turn_20: 60000 tokens (degradation begins) turn_30: 90000 tokens (significant degradation)


**Example 2: Mitigating Lost-in-Middle**
```markdown

turn_1: 1000 tokens turn_5: 8000 tokens turn_10: 25000 tokens turn_20: 60000 tokens (degradation begins) turn_30: 90000 tokens (significant degradation)


**示例2：缓解Lost-in-Middle**
```markdown

Organize context with critical info at edges

[CURRENT TASK] # At start

Goal: Generate quarterly report
Deadline: End of week

[DETAILED CONTEXT] # Middle (less attention)

50 pages of data
Multiple analysis sections
Supporting evidence

[KEY FINDINGS] # At end

Revenue up 15%
Costs down 8%
Growth in Region A

undefined

[CURRENT TASK] # At start

Goal: Generate quarterly report
Deadline: End of week

[DETAILED CONTEXT] # Middle (less attention)

50 pages of data
Multiple analysis sections
Supporting evidence

[KEY FINDINGS] # At end

Revenue up 15%
Costs down 8%
Growth in Region A

undefined

Guidelines

指南

Monitor context length and performance correlation during development
Place critical information at beginning or end of context
Implement compaction triggers before degradation becomes severe
Validate retrieved documents for accuracy before adding to context
Use versioning to prevent outdated information from causing clash
Segment tasks to prevent context confusion across different objectives
Design for graceful degradation rather than assuming perfect conditions
Test with progressively larger contexts to find degradation thresholds

在开发过程中监控上下文长度与性能的相关性
将关键信息放在上下文的开头或结尾
在退化变得严重前触发压缩
在将检索到的文档添加到上下文前验证其准确性
使用版本控制防止过时信息引发冲突
拆分任务以避免不同目标之间的上下文混淆
针对优雅退化进行设计，而非假设完美条件
使用逐渐增大的上下文进行测试，以找到退化阈值

Gotchas

注意事项

Normal variance looks like degradation: Model output quality fluctuates naturally across runs. Do not diagnose degradation from a single drop in quality — establish a baseline over multiple runs and look for sustained, correlated decline tied to context growth. A 5-10% quality dip on one run is noise; the same dip consistently appearing after 40K tokens is signal.
Model-specific thresholds go stale: The degradation onset values in benchmark tables reflect specific model versions. Provider updates, fine-tuning changes, and infrastructure shifts can move thresholds by 20-50% in either direction. Re-benchmark quarterly and after any major model update rather than treating published thresholds as permanent.
Needle-in-haystack scores create false confidence: A model scoring 99% on needle-in-haystack does not mean it handles 128K tokens well in production. Needle tests measure single-fact retrieval from passive context — real workloads require multi-fact reasoning, instruction following, and synthesis across the full window. Use task-specific benchmarks that mirror actual workload patterns.
Contradictory retrieved documents poison silently: When a RAG pipeline retrieves two documents that disagree on a fact, the model may silently pick one without signaling the conflict. This looks like a correct response but is effectively random. Implement contradiction detection in the retrieval layer before documents enter context.
Prompt quality problems masquerade as degradation: Poor prompt structure (ambiguous instructions, missing constraints, unclear task framing) produces symptoms identical to context degradation — inconsistent outputs, ignored instructions, wrong tool usage. Before diagnosing degradation, verify the same prompt works correctly at low context lengths. If it fails at 2K tokens, the problem is the prompt, not the context.
Degradation is non-linear with a cliff edge: Performance does not degrade gradually — it holds steady until a model-specific threshold, then drops sharply. Systems designed for "graceful degradation" often miss this pattern because monitoring checks assume linear decline. Set compaction triggers well before the cliff (at 70% of known onset), not at the onset itself.
Over-organizing context can backfire: Intuitively, well-structured and coherent context should outperform disorganized content. Research shows shuffled haystacks sometimes outperform coherent ones for retrieval tasks because coherent context creates false associations. Test whether heavy structural formatting actually helps for the specific task — do not assume it does.

正常波动看似退化：模型输出质量在不同运行中会自然波动。不要仅凭单次质量下降就诊断为退化——需在多次运行中建立基线，并寻找与上下文增长相关的持续、同步的性能下降。单次运行中5-10%的质量下降是噪音；在40K token后持续出现相同程度的下降则是信号。
模型特定阈值会过时：基准表中的退化起始值反映的是特定模型版本。供应商更新、微调变化和基础设施调整可能会使阈值在两个方向上变化20-50%。每季度重新进行基准测试，并在任何重大模型更新后重新测试，而非将发布的阈值视为永久值。
大海捞针分数会产生虚假信心：在大海捞针测试中得分为99%的模型，并不意味着它能在生产环境中很好地处理128K token的上下文。大海捞针测试衡量的是从被动上下文中检索单个事实的能力——实际工作负载需要多事实推理、指令遵循和全窗口内容的综合。使用能反映实际工作负载模式的特定任务基准。
矛盾的检索文档会无声污染：当RAG管道检索到两个在事实方面存在分歧的文档时，模型可能会无声地选择其中一个，而不发出冲突信号。这看起来像是正确的响应，但实际上是随机的。需在检索层实现矛盾检测，然后再将文档送入上下文。
提示质量问题伪装成退化：糟糕的提示结构（模糊的指令、缺失的约束、不清晰的任务框架）会产生与上下文退化相同的症状——输出不一致、忽略指令、工具使用错误。在诊断退化前，验证相同的提示在短上下文长度下是否能正常工作。如果在2K token时就失败，问题出在提示而非上下文。
退化是非线性的，存在悬崖边缘：性能不会逐渐下降——它在达到模型特定阈值前保持稳定，然后急剧下降。为“优雅退化”设计的系统通常会忽略这种模式，因为监控检查假设性能是线性下降的。需在悬崖边缘之前（已知起始值的70%）设置压缩触发点，而非在起始值处。
过度组织上下文可能适得其反：直觉上，结构良好、连贯的上下文应该优于杂乱无章的内容。研究发现，对于检索任务，打乱顺序的“草堆”有时性能优于连贯的“草堆”，因为连贯的上下文会产生错误关联。测试大量结构格式化是否对特定任务有帮助——不要想当然。

Integration

集成

This skill builds on context-fundamentals and should be studied after understanding basic context concepts. It connects to:

context-optimization - Techniques for mitigating degradation
multi-agent-patterns - Using isolation to prevent degradation
evaluation - Measuring and detecting degradation in production

该技能基于context-fundamentals构建，应在理解基本上下文概念后学习。它与以下内容相关：

context-optimization - 缓解退化的技术
multi-agent-patterns - 使用隔离防止退化
evaluation - 在生产环境中测量和检测退化

References

参考资料

Internal reference:

Degradation Patterns Reference - Read when: debugging a specific degradation pattern and needing implementation-level detection code (attention analysis, poisoning tracking, relevance scoring, recovery procedures)

Related skills in this collection:

context-fundamentals - Read when: lacking foundational understanding of context windows, token budgets, or placement mechanics
context-optimization - Read when: degradation is diagnosed and specific mitigation techniques (compaction, compression, masking) are needed
evaluation - Read when: setting up production monitoring to detect degradation before it impacts users

External resources:

Liu et al., 2023 "Lost in the Middle" - Read when: needing primary research backing for U-shaped attention claims or designing position-aware context layouts
RULER benchmark documentation - Read when: evaluating model claims about long-context support or comparing models for context-heavy workloads
Production engineering guides from AI labs - Read when: implementing context management in production infrastructure

内部参考：

Degradation Patterns Reference - 阅读场景：调试特定的退化模式，需要实现级别的检测代码（注意力分析、污染跟踪、相关性评分、恢复程序）

本集合中的相关技能：

context-fundamentals - 阅读场景：缺乏对上下文窗口、token预算或布局机制的基础理解
context-optimization - 阅读场景：已诊断出退化，需要特定的缓解技术（压缩、精简、掩码）
evaluation - 阅读场景：设置生产监控以在退化影响用户前检测到它

外部资源：

Liu等人，2023年《Lost in the Middle》 - 阅读场景：需要U型注意力主张的原始研究支持，或设计位置感知的上下文布局
RULER基准文档 - 阅读场景：评估模型对长上下文支持的声明，或比较模型在上下文密集型工作负载中的表现
AI实验室的生产工程指南 - 阅读场景：在生产基础设施中实现上下文管理

Skill Metadata

技能元数据

Created: 2025-12-20 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 2.0.0

创建时间：2025-12-20 最后更新时间：2026-03-17 作者：Agent Skills for Context Engineering Contributors 版本：2.0.0