context-compression

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Context Compression Strategies

上下文压缩策略

When agent sessions generate millions of tokens of conversation history, compression becomes mandatory. The naive approach is aggressive compression to minimize tokens per request. The correct optimization target is tokens per task: total tokens consumed to complete a task, including re-fetching costs when compression loses critical information.
当Agent会话生成数百万token的对话历史时,压缩就成为必要操作。简单粗暴的方法是激进压缩以最小化每次请求的token数量,但正确的优化目标应该是按任务统计token使用量:完成一项任务消耗的总token数,包括因压缩丢失关键信息而需要重新获取内容的成本。

When to Activate

触发时机

Activate this skill when:
  • Agent sessions exceed context window limits
  • Codebases exceed context windows (5M+ token systems)
  • Designing conversation summarization strategies
  • Debugging cases where agents "forget" what files they modified
  • Building evaluation frameworks for compression quality
在以下场景触发本技能:
  • Agent会话超出上下文窗口限制
  • 代码库超出上下文窗口(500万+ token的系统)
  • 设计对话总结策略时
  • 调试Agent“忘记”修改过哪些文件的问题时
  • 构建压缩质量评估框架时

Core Concepts

核心概念

Context compression trades token savings against information loss. Select from three production-ready approaches based on session characteristics:
  1. Anchored Iterative Summarization: Implement this for long-running sessions where file tracking matters. Maintain structured, persistent summaries with explicit sections for session intent, file modifications, decisions, and next steps. When compression triggers, summarize only the newly-truncated span and merge with the existing summary rather than regenerating from scratch. This prevents drift that accumulates when summaries are regenerated wholesale — each regeneration risks losing details the model considers low-priority but the task requires. Structure forces preservation because dedicated sections act as checklists the summarizer must populate, catching silent information loss.
  2. Opaque Compression: Reserve this for short sessions where re-fetching costs are low and maximum token savings are required. It produces compressed representations optimized for reconstruction fidelity, achieving 99%+ compression ratios but sacrificing interpretability entirely. The tradeoff matters: there is no way to verify what was preserved without running probe-based evaluation, so never use this when debugging or artifact tracking is critical.
  3. Regenerative Full Summary: Use this when summary readability is critical and sessions have clear phase boundaries. It generates detailed structured summaries on each compression trigger. The weakness is cumulative detail loss across repeated cycles — each full regeneration is a fresh pass that may deprioritize details preserved in earlier summaries.
上下文压缩是在节省token和信息丢失之间做权衡。可根据会话特性选择三种已投入生产的方案:
  1. Anchored Iterative Summarization:适用于需要跟踪文件的长期会话。维护结构化的持久化总结,包含会话意图、文件修改、决策和下一步计划等明确板块。触发压缩时,仅对新截断的内容进行总结,并与现有总结合并,而非从头重新生成。这避免了全量重新总结时累积的信息偏差——每次全量生成都可能丢失模型认为优先级低但任务必需的细节。结构化的格式强制保留关键信息,因为专用板块相当于总结器必须填写的清单,能发现隐性的信息丢失。
  2. Opaque Compression:适用于重新获取成本低、需要最大程度节省token的短会话。它生成针对重构保真度优化的压缩表示,压缩率可达99%以上,但完全牺牲了可解释性。需要注意权衡:如果不进行基于探针的评估,无法验证保留了哪些信息,因此在需要调试或跟踪工件时绝不能使用这种方法。
  3. Regenerative Full Summary:适用于总结可读性要求高、会话有明确阶段边界的场景。每次触发压缩时都会生成详细的结构化总结。缺点是多次循环后会累积细节丢失——每次全量重新生成都是一次全新处理,可能会忽略之前总结中保留的细节。

Detailed Topics

详细主题

Optimize for Tokens-Per-Task, Not Tokens-Per-Request

按任务优化token使用,而非按请求

Measure total tokens consumed from task start to completion, not tokens per individual request. When compression drops file paths, error messages, or decision rationale, the agent must re-explore, re-read files, and re-derive conclusions — wasting far more tokens than the compression saved. A strategy saving 0.5% more tokens per request but causing 20% more re-fetching costs more overall. Track re-fetching frequency as the primary quality signal: if the agent repeatedly asks to re-read files it already processed, compression is too aggressive.
统计从任务开始到完成的总token消耗,而非单次请求的token数。当压缩丢失文件路径、错误信息或决策依据时,Agent必须重新探索、重新读取文件并重新推导结论——这浪费的token远多于压缩节省的量。一种策略虽然能让单次请求多节省0.5%的token,但会导致重新获取成本增加20%,总体消耗反而更高。将重新获取频率作为主要的质量信号:如果Agent反复要求重新读取已处理过的文件,说明压缩过于激进。

Solve the Artifact Trail Problem First

优先解决工件跟踪问题

Artifact trail integrity is the weakest dimension across all compression methods, scoring only 2.2-2.5 out of 5.0 in evaluations. Address this proactively because general summarization cannot reliably maintain it.
Preserve these categories explicitly in every compression cycle:
  • Which files were created (full paths)
  • Which files were modified and what changed (include function names, not just file names)
  • Which files were read but not changed
  • Specific identifiers: function names, variable names, error messages, error codes
Implement a separate artifact index or explicit file-state tracking in agent scaffolding rather than relying on the summarizer to capture these details. Even structured summarization with dedicated file sections struggles with completeness over long sessions.
在所有压缩方法中,工件跟踪的完整性是最薄弱的维度,评估得分仅为2.2-2.5分(满分5分)。需要主动解决这个问题,因为通用总结无法可靠地维护工件跟踪。
在每个压缩周期中,必须明确保留以下类别信息:
  • 创建了哪些文件(完整路径)
  • 修改了哪些文件及具体变更内容(包括函数名,而非仅文件名)
  • 读取但未修改的文件
  • 特定标识符:函数名、变量名、错误信息、错误代码
应在Agent框架中实现单独的工件索引或明确的文件状态跟踪,而非依赖总结器捕获这些细节。即使是带专用文件板块的结构化总结,在长期会话中也难以保证完整性。

Structure Summaries with Mandatory Sections

为总结设置强制板块

Build structured summaries with explicit sections that prevent silent information loss. Each section acts as a checklist the summarizer must populate, making omissions visible rather than silent.
markdown
undefined
构建带有明确板块的结构化总结,避免隐性信息丢失。每个板块相当于总结器必须填写的清单,让遗漏的信息可见而非隐藏。
markdown
undefined

Session Intent

Session Intent

[What the user is trying to accomplish]
[用户的目标是什么]

Files Modified

Files Modified

  • auth.controller.ts: Fixed JWT token generation
  • config/redis.ts: Updated connection pooling
  • tests/auth.test.ts: Added mock setup for new config
  • auth.controller.ts: 修复JWT token生成问题
  • config/redis.ts: 更新连接池配置
  • tests/auth.test.ts: 为新配置添加模拟设置

Decisions Made

Decisions Made

  • Using Redis connection pool instead of per-request connections
  • Retry logic with exponential backoff for transient failures
  • 使用Redis连接池替代每次请求新建连接
  • 为临时故障添加指数退避重试逻辑

Current State

Current State

  • 14 tests passing, 2 failing
  • Remaining: mock setup for session service tests
  • 14个测试通过,2个失败
  • 剩余任务:为会话服务测试设置模拟环境

Next Steps

Next Steps

  1. Fix remaining test failures
  2. Run full test suite
  3. Update documentation

Adapt sections to the agent's domain. A debugging agent needs "Root Cause" and "Error Messages"; a migration agent needs "Source Schema" and "Target Schema." The structure matters more than the specific sections — any explicit schema outperforms freeform summarization.
  1. 修复剩余的测试失败问题
  2. 运行完整测试套件
  3. 更新文档

可根据Agent的领域调整板块。调试Agent需要“根本原因”和“错误信息”板块;迁移Agent需要“源 schema”和“目标 schema”板块。结构比具体板块更重要——任何明确的框架都优于自由格式的总结。

Choose Compression Triggers Strategically

策略性选择压缩触发时机

When to trigger compression matters as much as how to compress. Select a trigger strategy based on session predictability:
StrategyTrigger PointTrade-off
Fixed threshold70-80% context utilizationSimple but may compress too early
Sliding windowKeep last N turns + summaryPredictable context size
Importance-basedCompress low-relevance sections firstComplex but preserves signal
Task-boundaryCompress at logical task completionsClean summaries but unpredictable timing
Default to sliding window with structured summaries for coding agents — it provides the best balance of predictability and quality. Use task-boundary triggers when sessions have clear phase transitions (e.g., research then implementation then testing).
何时触发压缩与如何压缩同样重要。根据会话的可预测性选择触发策略:
策略触发点权衡点
固定阈值上下文使用率达70-80%简单但可能过早压缩
滑动窗口保留最后N轮对话+总结上下文大小可预测
基于重要性优先压缩低相关性内容复杂度高但能保留关键信号
任务边界在逻辑任务完成时压缩总结清晰但时机不可预测
对于编码Agent,默认使用滑动窗口+结构化总结的组合——它在可预测性和质量之间达到最佳平衡。当会话有明确的阶段过渡(如研究→实现→测试)时,使用任务边界触发。

Evaluate Compression with Probes, Not Metrics

用探针评估压缩效果,而非传统指标

Traditional metrics like ROUGE or embedding similarity fail to capture functional compression quality. A summary can score high on lexical overlap while missing the one file path the agent needs to continue.
Use probe-based evaluation: after compression, pose questions that test whether critical information survived. If the agent answers correctly, compression preserved the right information. If not, it guesses or hallucinates.
Probe TypeWhat It TestsExample Question
RecallFactual retention"What was the original error message?"
ArtifactFile tracking"Which files have we modified?"
ContinuationTask planning"What should we do next?"
DecisionReasoning chain"What did we decide about the Redis issue?"
ROUGE或嵌入相似度等传统指标无法体现压缩的功能性质量。一个总结可能在词汇重叠度上得分很高,但却丢失了Agent继续任务所需的唯一文件路径。
使用基于探针的评估:压缩后,提出测试关键信息是否保留的问题。如果Agent能正确回答,说明压缩保留了必要信息;如果不能,说明它在猜测或幻觉。
探针类型测试内容示例问题
召回率事实留存“原始错误信息是什么?”
工件跟踪文件跟踪“我们修改过哪些文件?”
任务延续任务规划“我们下一步应该做什么?”
决策验证推理链“我们针对Redis问题做出了什么决策?”

Score Compression Across Six Dimensions

从六个维度评估压缩效果

Evaluate compression quality for coding agents across these dimensions. Accuracy shows the largest variation between methods (0.6 point gap), making it the strongest discriminator. Artifact trail is universally weak (2.2-2.5), confirming it needs specialized handling beyond general summarization.
  1. Accuracy: Are technical details correct — file paths, function names, error codes?
  2. Context Awareness: Does the response reflect current conversation state?
  3. Artifact Trail: Does the agent know which files were read or modified?
  4. Completeness: Does the response address all parts of the question?
  5. Continuity: Can work continue without re-fetching information?
  6. Instruction Following: Does the response respect stated constraints?
针对编码Agent,从以下六个维度评估压缩质量。准确率在不同方法间差异最大(相差0.6分),是区分方法优劣的最强指标。工件跟踪的得分普遍较低(2.2-2.5分),这证实它需要专用处理,而非依赖通用总结。
  1. 准确率:技术细节(文件路径、函数名、错误代码)是否正确?
  2. 上下文感知:响应是否反映当前会话状态?
  3. 工件跟踪:Agent是否知晓读取或修改过的文件?
  4. 完整性:响应是否覆盖问题的所有部分?
  5. 连续性:无需重新获取信息即可继续任务吗?
  6. 指令遵循:响应是否符合指定约束?

Practical Guidance

实践指南

Apply the Three-Phase Compression Workflow for Large Codebases

针对大型代码库应用三阶段压缩工作流

For codebases or agent systems exceeding context windows, compress through three sequential phases. Each phase narrows context so the next phase operates within budget.
  1. Research Phase: Explore architecture diagrams, documentation, and key interfaces. Compress exploration into a structured analysis of components, dependencies, and boundaries. Output: a single research document that replaces raw exploration.
  2. Planning Phase: Convert the research document into an implementation specification with function signatures, type definitions, and data flow. A 5M-token codebase compresses to approximately 2,000 words of specification at this stage.
  3. Implementation Phase: Execute against the specification. Context stays focused on the spec plus active working files, not raw codebase exploration. This phase rarely needs further compression because the spec is already compact.
对于超出上下文窗口的代码库或Agent系统,通过三个连续阶段进行压缩。每个阶段都会缩小上下文范围,使下一阶段的操作在预算内进行。
  1. 研究阶段:探索架构图、文档和关键接口。将探索内容压缩为关于组件、依赖关系和边界的结构化分析。输出:一份替代原始探索内容的研究文档。
  2. 规划阶段:将研究文档转换为包含函数签名、类型定义和数据流的实现规范。一个500万token的代码库在此阶段可压缩为约2000字的规范。
  3. 实现阶段:根据规范执行任务。上下文聚焦于规范和当前工作文件,而非原始代码库探索内容。此阶段通常无需进一步压缩,因为规范已经很紧凑。

Use Example Artifacts as Compression Seeds

以示例工件作为压缩种子

When provided with a manual migration example or reference PR, use it as a template to understand the target pattern rather than exploring the codebase from scratch. The example reveals constraints static analysis cannot surface: which invariants must hold, which services break on changes, and what a clean implementation looks like.
This matters most when the agent cannot distinguish essential complexity (business requirements) from accidental complexity (legacy workarounds). The example artifact encodes that distinction implicitly, saving tokens that would otherwise go to trial-and-error exploration.
当提供手动迁移示例或参考PR时,将其作为模板来理解目标模式,而非从头探索代码库。示例会揭示静态分析无法发现的约束:哪些不变量必须保留、哪些服务会因变更而中断、以及整洁的实现是什么样的。
当Agent无法区分本质复杂度(业务需求)和偶然复杂度(遗留解决方案)时,这一点尤为重要。示例工件隐含了这种区分,节省了原本用于反复试错探索的token。

Implement Anchored Iterative Summarization Step by Step

逐步实现Anchored Iterative Summarization

  1. Define explicit summary sections matching the agent's domain (debugging, migration, feature development)
  2. On first compression trigger, summarize the truncated history into those sections
  3. On subsequent compressions, summarize only newly truncated content — do not re-summarize the existing summary
  4. Merge new information into existing sections rather than regenerating them, deduplicating by file path and decision identity
  5. Tag which information came from which compression cycle — this enables debugging when summaries drift
  1. 定义与Agent领域(调试、迁移、功能开发)匹配的明确总结板块
  2. 首次触发压缩时,将截断的历史内容总结到这些板块中
  3. 后续压缩时,仅总结新截断的内容——不要重新总结现有总结
  4. 将新信息合并到现有板块中,而非重新生成,按文件路径和决策标识去重
  5. 标记信息来自哪个压缩周期——这有助于在总结出现偏差时进行调试

Select the Right Approach for the Session Profile

根据会话特征选择合适的方法

Use anchored iterative summarization when:
  • Sessions are long-running (100+ messages)
  • File tracking matters (coding, debugging)
  • Verification of preserved information is needed
Use opaque compression when:
  • Maximum token savings are required
  • Sessions are relatively short
  • Re-fetching costs are low (e.g., no file system access needed)
Use regenerative summaries when:
  • Summary interpretability is critical for human review
  • Sessions have clear phase boundaries
  • Full context review is acceptable on each compression trigger
选择Anchored Iterative Summarization的场景:
  • 会话为长期运行(100+条消息)
  • 需要跟踪文件(编码、调试)
  • 需要验证保留的信息
选择Opaque Compression的场景:
  • 需要最大程度节省token
  • 会话相对较短
  • 重新获取成本低(例如无需访问文件系统)
选择Regenerative Full Summary的场景:
  • 总结可读性对人工审核至关重要
  • 会话有明确的阶段边界
  • 每次压缩触发时可接受全上下文审查

Calibrate Compression Ratios by Method

按方法校准压缩率

MethodCompression RatioQuality ScoreTrade-off
Anchored Iterative98.6%3.70Best quality, slightly less compression
Regenerative98.7%3.44Good quality, moderate compression
Opaque99.3%3.35Best compression, quality loss
The 0.7% additional tokens retained by structured summarization buys 0.35 quality points — a significant gap when compounded over multiple compression cycles. For any task where re-fetching costs exist, this tradeoff favors structured approaches.
方法压缩率质量得分权衡点
Anchored Iterative98.6%3.70质量最佳,压缩率略低
Regenerative98.7%3.44质量良好,压缩率适中
Opaque99.3%3.35压缩率最高,质量损失大
结构化总结多保留的0.7% token换来了0.35分的质量提升——在多次压缩循环后,这一差距会显著放大。对于任何存在重新获取成本的任务,这种权衡都更倾向于结构化方法。

Examples

示例

Example 1: Debugging Session Compression
Original context (89,000 tokens, 178 messages):
  • 401 error on /api/auth/login endpoint
  • Traced through auth controller, middleware, session store
  • Found stale Redis connection
  • Fixed connection pooling, added retry logic
  • 14 tests passing, 2 failing
Structured summary after compression:
markdown
undefined
示例1:调试会话压缩
原始上下文(89000 token,178条消息):
  • /api/auth/login端点出现401错误
  • 跟踪了auth控制器、中间件、会话存储
  • 发现Redis连接过期
  • 修复了连接池,添加了重试逻辑
  • 14个测试通过,2个失败
压缩后的结构化总结:
markdown
undefined

Session Intent

Session Intent

Debug 401 Unauthorized error on /api/auth/login despite valid credentials.
调试/api/auth/login端点在凭据有效的情况下返回401 Unauthorized错误的问题。

Root Cause

Root Cause

Stale Redis connection in session store. JWT generated correctly but session could not be persisted.
会话存储中的Redis连接过期。JWT生成正确,但会话无法持久化。

Files Modified

Files Modified

  • auth.controller.ts: No changes (read only)
  • middleware/cors.ts: No changes (examined)
  • config/redis.ts: Fixed connection pooling configuration
  • services/session.service.ts: Added retry logic for transient failures
  • tests/auth.test.ts: Updated mock setup
  • auth.controller.ts: 无修改(仅读取)
  • middleware/cors.ts: 无修改(已检查)
  • config/redis.ts: 修复了连接池配置
  • services/session.service.ts: 为临时故障添加了重试逻辑
  • tests/auth.test.ts: 更新了模拟设置

Test Status

Test Status

14 passing, 2 failing (mock setup issues)
14个通过,2个失败(模拟设置问题)

Next Steps

Next Steps

  1. Fix remaining test failures (mock session service)
  2. Run full test suite
  3. Deploy to staging

**Example 2: Probe Response Quality**

After compression, asking "What was the original error?":

Good response (structured summarization):
> "The original error was a 401 Unauthorized response from the /api/auth/login endpoint. Users received this error with valid credentials. Root cause was stale Redis connection in session store."

Poor response (aggressive compression):
> "We were debugging an authentication issue. The login was failing. We fixed some configuration problems."

The structured response preserves endpoint, error code, and root cause. The aggressive response loses all technical detail.
  1. 修复剩余的测试失败问题(模拟会话服务)
  2. 运行完整测试套件
  3. 部署到预发布环境

**示例2:探针响应质量**

压缩后,询问“原始错误是什么?”:

优秀响应(结构化总结):
> “原始错误是/api/auth/login端点返回的401 Unauthorized响应。用户提供有效凭据时仍会收到该错误。根本原因是会话存储中的Redis连接过期。”

糟糕响应(激进压缩):
> “我们正在调试一个认证问题。登录失败了。我们修复了一些配置问题。”

结构化响应保留了端点、错误代码和根本原因。激进压缩丢失了所有技术细节。

Guidelines

准则

  1. Optimize for tokens-per-task, not tokens-per-request
  2. Use structured summaries with explicit sections for file tracking
  3. Trigger compression at 70-80% context utilization
  4. Implement incremental merging rather than full regeneration
  5. Test compression quality with probe-based evaluation
  6. Track artifact trail separately if file tracking is critical
  7. Accept slightly lower compression ratios for better quality retention
  8. Monitor re-fetching frequency as a compression quality signal
  1. 永远不要压缩工具定义或 schema:压缩函数调用schema、API规范或工具定义会完全破坏Agent的功能。Agent无法调用参数名称或类型被总结掉的工具。将工具定义视为绕过压缩的锚点。
  2. 压缩后的总结会产生幻觉:当大语言模型总结对话历史时,可能会引入看似合理但从未出现在原始内容中的细节。在丢弃原始内容前,始终根据源材料验证压缩输出——尤其是文件路径、错误代码和数值,总结器可能会“四舍五入”或编造这些内容。
  3. 压缩会破坏工件引用:文件路径、提交SHA、变量名和代码片段在压缩过程中会被改写或丢弃。当Agent需要
    config/redis.ts
    时,总结中只说“更新了配置文件”会导致重新探索。在专用板块中逐字保留标识符,而非将其嵌入散文中。
  4. 早期对话包含不可替代的约束:会话的前几条消息通常包含任务设置、用户约束和架构决策,这些内容无法重新推导。保护早期对话不被压缩,或将其约束提取到能在所有压缩周期中保留的持久化前置内容中。
  5. 激进压缩率会在循环中累积损失:95%的压缩率单次使用看似安全,但反复应用会累积损失。经过三次95%的压缩循环后,仅剩下0.0125%的原始token。校准压缩率时要考虑多次压缩循环,而非单次处理。
  6. 代码和散文需要不同的压缩策略:散文的冗余度高,压缩效果好。代码则不然——从函数签名或导入路径中删除一个token就可能使其失效。应用领域特定的压缩策略:积极总结散文部分,同时逐字保留代码块和结构化数据。
  7. 基于探针的评估可能给出错误信心:即使关键信息丢失,探针也可能通过,因为探针只测试它所询问的内容。如果探针集只检查文件名而不检查函数签名,就会遗漏签名丢失的问题。设计覆盖所有六个评估维度的探针集,并在评估运行中轮换探针集以避免盲点。

Gotchas

集成

  1. Never compress tool definitions or schemas: Compressing function call schemas, API specs, or tool definitions destroys agent functionality entirely. The agent cannot invoke tools whose parameter names or types have been summarized away. Treat tool definitions as immutable anchors that bypass compression.
  2. Compressed summaries hallucinate facts: When an LLM summarizes conversation history, it may introduce plausible-sounding details that never appeared in the original. Always validate compressed output against source material before discarding originals — especially for file paths, error codes, and numeric values that the summarizer may "round" or fabricate.
  3. Compression breaks artifact references: File paths, commit SHAs, variable names, and code snippets get paraphrased or dropped during compression. A summary saying "updated the config file" when the agent needs
    config/redis.ts
    causes re-exploration. Preserve identifiers verbatim in dedicated sections rather than embedding them in prose.
  4. Early turns contain irreplaceable constraints: The first few turns of a session often contain task setup, user constraints, and architectural decisions that cannot be re-derived. Protect early turns from compression or extract their constraints into a persistent preamble that survives all compression cycles.
  5. Aggressive ratios compound across cycles: A 95% compression ratio seems safe once, but applying it repeatedly compounds losses. After three cycles at 95%, only 0.0125% of original tokens remain. Calibrate ratios assuming multiple compression cycles, not a single pass.
  6. Code and prose need different compression: Prose compresses well because natural language is redundant. Code does not — removing a single token from a function signature or import path can make it useless. Apply domain-specific compression strategies: summarize prose sections aggressively while preserving code blocks and structured data verbatim.
  7. Probe-based evaluation gives false confidence: Probes can pass despite critical information being lost, because the probes test only what they ask about. A probe set that checks file names but not function signatures will miss signature loss. Design probes to cover all six evaluation dimensions, and rotate probe sets across evaluation runs to avoid blind spots.
本技能与集合中的其他多项技能相关:
  • context-degradation:压缩是缓解上下文退化的策略
  • context-optimization:压缩是众多优化技术之一
  • evaluation:基于探针的评估适用于压缩测试
  • memory-systems:压缩与暂存区和总结内存模式相关

Integration

参考资料

This skill connects to several others in the collection:
  • context-degradation - Compression is a mitigation strategy for degradation
  • context-optimization - Compression is one optimization technique among many
  • evaluation - Probe-based evaluation applies to compression testing
  • memory-systems - Compression relates to scratchpad and summary memory patterns
内部参考:
  • Evaluation Framework Reference - 阅读场景:构建或校准基于探针的评估流水线,或需要压缩质量评估的评分标准和大语言模型评判者配置时
集合中的相关技能:
  • context-degradation:阅读场景:诊断Agent性能随会话时长下降的原因,在应用压缩作为缓解措施之前
  • context-optimization:阅读场景:仅靠压缩不足以解决问题,需要更广泛的优化策略(修剪、缓存、路由)时
  • evaluation:阅读场景:设计超出压缩专用探针的评估框架,包括通用大语言模型评判者方法论
外部资源:
  • Factory Research: Evaluating Context Compression for AI Agents (2025年12月) - 阅读场景:需要压缩方法对比的基准数据,或36000条消息的评估数据集时
  • 大语言模型评判者评估方法论研究(Zheng等人,2023)- 阅读场景:实现或验证大语言模型评判者评分,以理解偏差模式和校准方法时
  • Netflix Engineering: 《The Infinite Software Crisis》- 大规模场景下的三阶段工作流和上下文压缩(2025年AI峰会)- 阅读场景:为大型代码库实现三阶段压缩工作流,或理解生产级上下文管理时

References

Skill Metadata

Internal reference:
  • Evaluation Framework Reference - Read when: building or calibrating a probe-based evaluation pipeline, or when needing scoring rubrics and LLM judge configuration for compression quality assessment
Related skills in this collection:
  • context-degradation - Read when: diagnosing why agent performance drops over long sessions, before applying compression as a mitigation
  • context-optimization - Read when: compression alone is insufficient and broader optimization strategies (pruning, caching, routing) are needed
  • evaluation - Read when: designing evaluation frameworks beyond compression-specific probes, including general LLM-as-judge methodology
External resources:
  • Factory Research: Evaluating Context Compression for AI Agents (December 2025) - Read when: needing benchmark data on compression method comparisons or the 36,000-message evaluation dataset
  • Research on LLM-as-judge evaluation methodology (Zheng et al., 2023) - Read when: implementing or validating LLM judge scoring to understand bias patterns and calibration
  • Netflix Engineering: "The Infinite Software Crisis" - Three-phase workflow and context compression at scale (AI Summit 2025) - Read when: implementing the three-phase compression workflow for large codebases or understanding production-scale context management

创建时间:2025-12-22 最后更新:2026-03-17 作者:Agent Skills for Context Engineering 贡献者 版本:1.2.0

Skill Metadata

Created: 2025-12-22 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 1.2.0