latent-briefing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Latent Briefing and KV Cache Memory Sharing

Latent Briefing 与 KV 缓存内存共享

Hierarchical multi-agent systems often pay for the same context twice. The orchestrator accumulates a long reasoning trajectory, but each worker usually receives only a narrow text handoff such as a subtask prompt plus raw document slices. Passing the full trajectory fixes coverage but drives token cost up on every worker call. Summarization introduces latency and information loss. Retrieval helps with document access but does not preserve the orchestrator's evolving reasoning state.
Latent Briefing addresses this by sharing memory at the representation level rather than the text level. The core idea is to compact the orchestrator trajectory in the worker model's KV cache, keeping positions that are most relevant to the current worker task. The method builds on Attention Matching (AM) KV cache compaction and adapts it for inference-time multi-agent handoff with task-guided queries, a shared token mask across heads, and robust thresholding.
分层多智能体系统常常会为相同的上下文付出双倍代价。编排器会积累冗长的推理轨迹,但每个工作者通常只会收到范围狭窄的文本传递,例如子任务提示加上原始文档片段。传递完整轨迹可以解决覆盖问题,但会导致每次调用工作者时的令牌成本飙升。总结会引入延迟和信息丢失。检索有助于访问文档,但无法保留编排器不断演变的推理状态。
Latent Briefing 通过在表示层而非文本层共享内存来解决这一问题。核心思路是在工作者模型的KV缓存中压缩编排器轨迹,保留与当前工作者任务最相关的位置。该方法基于Attention Matching (AM) KV缓存压缩技术,并针对推理阶段的多智能体传递进行了适配,包括任务引导查询、跨注意力头的共享令牌掩码以及鲁棒阈值处理。

When to Activate

激活场景

Activate this skill when:
  • Designing orchestrator-worker or supervisor-specialist systems where workers need access to prior orchestrator state without replaying the full trajectory as text
  • Evaluating alternatives to LLM summarization or RAG for cross-agent state transfer
  • Implementing or studying KV cache compaction as a first-class inference primitive, not only prefix caching of identical prompts
  • Debugging token explosion in recursive, hierarchical, or tool-heavy agent graphs
  • Interpreting benchmarks that report worker-token savings, total-token savings, compaction overhead, and accuracy together
在以下场景中激活此技能:
  • 设计编排器-工作者或监督者-专家系统,其中工作者需要访问编排器的历史状态,但无需将完整轨迹以文本形式重放
  • 评估用于跨智能体状态传递的LLM总结或RAG替代方案
  • KV缓存压缩作为一级推理原语实现或研究,而不仅仅是对相同提示的前缀缓存
  • 调试递归、分层或工具密集型智能体图中的令牌爆炸问题
  • 解读同时报告工作者令牌节省量、总令牌节省量、压缩开销和准确率的基准测试结果

Core Concepts

核心概念

The token explosion pattern. In recursive or REPL-style systems, the orchestrator repeatedly calls a worker to inspect evidence, verify hypotheses, or answer subquestions. The orchestrator's trajectory grows with partial conclusions, dead ends, tool output, and prior worker responses. If that trajectory is passed in full on every worker call, cost compounds quickly.
Representation-level sharing. Instead of summarizing the trajectory into natural language, the system operates on the worker model's KV cache. It retains the positions that the worker would attend to for the current task and drops the rest. This is more specific than ordinary prefix caching: prefix caching reuses identical prefixes, while Latent Briefing also performs task-conditioned selective retention inside the reused trajectory.
Attention Matching as the compaction engine. AM seeks a smaller cache whose attention outputs approximate the full cache. Latent Briefing adapts AM for multi-agent inference by changing the scoring signal and batching strategy:
  1. Use task-guided query vectors derived from the current worker prompt.
  2. Aggregate scores into a shared global mask instead of per-head independent subsets.
  3. Use a robust threshold such as
    median + tau * MAD
    rather than fixed top-k per head.
Reference result shape. The public write-up reports substantial worker-token reduction, material total-token savings, and low-single-digit-second compaction overhead on long-document QA workloads. Treat these numbers as workload-specific evidence, not a general guarantee.
令牌爆炸模式。在递归或REPL风格的系统中,编排器会反复调用工作者来检查证据、验证假设或回答子问题。编排器的轨迹会随着部分结论、死胡同、工具输出和工作者历史响应而不断增长。如果每次调用工作者时都传递完整轨迹,成本会迅速累积。
表示层共享。无需将轨迹总结为自然语言,系统直接在工作者模型的KV缓存上操作。它保留工作者在当前任务中会关注的位置,丢弃其余部分。这比普通的前缀缓存更具针对性:前缀缓存重用相同的前缀,而Latent Briefing还会在重用的轨迹内执行任务条件化的选择性保留
Attention Matching作为压缩引擎。AM旨在生成一个更小的缓存,其注意力输出与完整缓存近似。Latent Briefing通过改变评分信号和批处理策略,将AM适配到多智能体推理场景:
  1. 使用从当前工作者提示派生的任务引导查询向量
  2. 将分数聚合为共享全局掩码,而非每个注意力头独立的子集。
  3. 使用鲁棒阈值,例如
    median + tau * MAD
    ,而非每个注意力头固定的top-k选择。
参考结果形态。公开报告显示,在长文档问答工作负载中,该方法实现了显著的工作者令牌减少、可观的总令牌节省量,以及低至个位数秒级的压缩开销。请将这些数据视为特定工作负载的证据,而非通用保证。

Detailed Topics

详细主题

Why Text-Only Mitigations Fall Short

纯文本缓解方案的局限性

ApproachPrimary weakness
LLM summarizationHigh latency, lossy abstraction, and no guarantee the summary preserves what the next subtask needs
Retrieval / RAGDepends on chunking and embeddings; can miss cross-chunk or cross-step dependencies
Pass full trajectoryCost scales with every worker call and irrelevant context can degrade worker quality
Latent Briefing is useful when the bottleneck is not document retrieval itself, but how to transfer orchestrator state into a worker efficiently and precisely.
方法主要缺陷
LLM 总结延迟高、抽象过程存在信息丢失,且无法保证总结内容保留了下一个子任务所需的信息
检索 / RAG依赖分块和嵌入技术;可能会遗漏跨块或跨步骤的依赖关系
传递完整轨迹成本随每次工作者调用而增加,无关上下文还会降低工作者的输出质量
当瓶颈并非文档检索本身,而是如何高效、精准地将编排器状态传递给工作者时,Latent Briefing就会发挥作用。

Recursive Orchestrator-Worker Shape

递归编排器-工作者架构

Frameworks such as Recursive Language Models treat long context as an environment and recurse over it: an orchestrator decomposes work and delegates to workers. Latent Briefing fits the gap where the orchestrator has already built task-specific state that should inform the worker, but re-serializing that state as text is too expensive or noisy.
In the ideal setup, the worker maintains a persistent KV state for the orchestrator trajectory. New trajectory tokens extend that state, then compaction runs just before generation for the current subtask.
诸如Recursive Language Models的框架将长上下文视为环境并对其进行递归处理:编排器分解工作并委派给工作者。Latent Briefing填补了这样的空白:编排器已经构建了特定于任务的状态,这些状态应该为工作者提供信息,但将该状态重新序列化为文本的成本过高或噪声过大。
在理想设置中,工作者会为编排器轨迹维护持久的KV状态。新的轨迹令牌会扩展该状态,然后在为当前子任务生成输出之前运行压缩操作。

Three Inference-Time Modifications

三个推理阶段修改

  1. Task-guided query vectors. Use queries from the current worker task prompt, not generic samples from the context. Forward-pass the trajectory plus current task through the worker model, then score trajectory positions by how strongly the task attends to them.
  2. Shared token selection. Aggregate scores across layers and heads into one per-position score. One shared mask enables batched operations and avoids hundreds of incompatible per-head solves.
  3. MAD thresholding. Keep positions above a robust outlier threshold such as
    median + tau * MAD
    . Higher
    tau
    is more aggressive. Optimal settings depend on task regime, trajectory quality, and document length.
  1. 任务引导查询向量。使用当前工作者任务提示中的查询,而非上下文中的通用样本。将轨迹加上当前任务通过工作者模型进行前向传递,然后根据任务对轨迹位置的关注程度为其评分。
  2. 共享令牌选择。将各层和各注意力头的分数聚合为每个位置的单一分数。一个共享掩码支持批处理操作,避免数百个不兼容的单注意力头求解。
  3. MAD阈值处理。保留高于鲁棒异常值阈值(如
    median + tau * MAD
    )的位置。
    tau
    值越高,压缩力度越大。最优设置取决于任务模式、轨迹质量和文档长度。

Infrastructure Preconditions

基础设施前提

Latent Briefing is only practical when the system controls the worker inference runtime closely enough to inspect or transform KV state. It is a poor default for API-only stacks where internal KV tensors are inaccessible. It also assumes the orchestrator trajectory can be represented in the worker's model space. If orchestrator and worker differ materially in tokenizer, architecture, or attention layout, direct representation sharing may not be viable.
只有当系统足够紧密地控制工作者推理运行时,能够检查或转换KV状态时,Latent Briefing才具有实用性。对于无法访问内部KV张量的纯API栈来说,它不是一个好的默认方案。此外,它还假设编排器轨迹可以在工作者的模型空间中表示。如果编排器和工作者在分词器、架构或注意力布局上存在实质性差异,直接的表示层共享可能不可行。

Decision Framework

决策框架

Choose the mechanism that matches the bottleneck:
NeedPreferWhy
Stable repeated prefix with minimal logic changesPrefix cachingCheapest optimization; no information loss
Human-readable and auditable cross-step stateStructured notes or summarizationEasy to inspect and store
Sparse lookup across a large external corpusRetrieval / RAGFinds documents efficiently
Worker needs task-specific slices of orchestrator state and runtime access existsLatent BriefingTransfers relevant latent state without replaying all text
Latent Briefing is not a universal replacement for summarization or retrieval. It is a specialized optimization for systems that already run a controllable orchestrator-worker stack.
选择与瓶颈匹配的机制:
需求首选方案原因
稳定的重复前缀,且需最小化逻辑变更前缀缓存成本最低的优化;无信息丢失
跨步骤状态具备可读性和可审计性结构化笔记或总结易于检查和存储
在大型外部语料库中进行稀疏查找检索 / RAG高效查找文档
工作者需要编排器状态的任务特定片段,且具备运行时访问权限Latent Briefing无需重放所有文本即可传递相关的潜在状态
Latent Briefing并非总结或检索的通用替代品。它是针对已运行可控编排器-工作者栈的系统的专用优化方案。

Threshold Regimes

阈值模式

Reported long-document QA results suggest:
  • Longer documents: lighter compaction can preserve broader evidence coverage while still saving tokens.
  • Harder questions: more aggressive compaction can help when the orchestrator trajectory contains speculative or low-value branches.
  • Shorter, easier contexts: moderate compaction may remove redundancy without dropping needed evidence.
These are tuning hypotheses, not portable laws. Re-measure on the target workload.
已报告的长文档问答结果表明:
  • 更长的文档:轻度压缩可以在节省令牌的同时保留更广泛的证据覆盖范围。
  • 更难的问题:当编排器轨迹包含推测性或低价值分支时,更激进的压缩会有所帮助。
  • 更短、更简单的上下文:适度压缩可以消除冗余,同时不会丢失所需证据。
这些是调优假设,而非通用规则。请在目标工作负载上重新测量。

Practical Guidance

实践指南

  • Define the shared memory boundary first. Decide exactly what enters the trajectory cache: prior worker replies, tool output, chain-of-thought, or only selected artifacts. Compaction quality depends on what is allowed into the cache in the first place.
  • Tune on validation data, not anecdotes. Track task accuracy, worker tokens, total tokens, retention rate, and compaction overhead together.
  • Measure end-to-end latency. Compaction only pays off if compaction plus generation beats the best text-layer alternative for the same quality target.
  • Use strong baselines. Compare against prefix caching, structured notes, retrieval, and selective text handoff, not only "send everything."
  • Expect orchestrator variance. If decomposition strategy changes run to run, average over enough trials to separate compaction effects from orchestrator noise.
  • 先定义共享内存边界。明确决定哪些内容进入轨迹缓存:工作者历史回复、工具输出、思维链,还是仅选定的工件。压缩质量首先取决于允许进入缓存的内容。
  • 基于验证数据调优,而非轶事。同时跟踪任务准确率、工作者令牌数、总令牌数、保留率和压缩开销。
  • 测量端到端延迟。只有当压缩加生成的延迟优于相同质量目标下的最佳文本层替代方案时,压缩才是值得的。
  • 使用强基准。与前缀缓存、结构化笔记、检索和选择性文本传递进行比较,而非仅与“传递全部内容”对比。
  • 考虑编排器的差异性。如果分解策略每次运行都有变化,请在足够多的试验中取平均值,以区分压缩效果与编排器噪声。

Examples

示例

Scenario: orchestrator trajectory grows across worker calls
text
Call 1: trajectory T1 -> worker answers subquestion A
Call 2: trajectory T2 = T1 + new reasoning + reply A
        compact KV(T2) using the task prompt for B
        worker answers subquestion B
The task prompt for B decides which parts of
T2
survive into the compacted worker state.
场景:编排器轨迹随工作者调用不断增长
text
Call 1: trajectory T1 -> worker answers subquestion A
Call 2: trajectory T2 = T1 + new reasoning + reply A
        compact KV(T2) using the task prompt for B
        worker answers subquestion B
针对子问题B的任务提示决定了
T2
中的哪些部分会保留到压缩后的工作者状态中。

Guidelines

指导原则

  1. Prefer Latent Briefing when the main waste comes from replaying orchestrator state into workers, not from retrieving source documents.
  2. Prefer plain text handoff when auditability, portability, or closed-model APIs matter more than token efficiency.
  3. Co-design compaction with evaluation. A small quality drop can erase large token savings.
  4. Expose compaction aggressiveness as a controlled parameter, not a hidden constant.
  1. 当主要浪费来自于向工作者重放编排器状态,而非检索源文档时,优先选择Latent Briefing。
  2. 当可审计性、可移植性或闭源模型API比令牌效率更重要时,优先选择纯文本传递。
  3. 将压缩与评估协同设计。微小的质量下降可能会抵消大量的令牌节省。
  4. 将压缩力度作为可控参数暴露,而非隐藏常量。

Gotchas

注意事项

  1. Infrastructure access is the first gate. If the runtime cannot inspect and rewrite worker KV state, Latent Briefing is a research idea, not a deployable technique.
  2. Shared model space matters. KV compaction is defined in a specific model's attention space. Do not assume latent handoff works cleanly across unrelated model families.
  3. Threshold is workload-dependent. One global
    tau
    rarely works across long vs short context and easy vs hard tasks. Expect accuracy cliffs when compaction becomes too aggressive.
  4. Benchmark scope is narrow. Public results focus on long-document QA. Code generation, math, and multi-document synthesis may behave differently.
  5. Orchestrator variance can hide the signal. A stochastic orchestrator can change the trajectory enough to swamp small compaction gains or losses.
  6. Weak baselines inflate the apparent win. Compare against strong text-level alternatives before claiming a system-level advantage.
  1. 基础设施访问是第一道门槛。如果运行时无法检查和重写工作者KV状态,Latent Briefing只是一个研究想法,而非可部署的技术。
  2. 共享模型空间至关重要。KV压缩是在特定模型的注意力空间中定义的。不要假设潜在传递在不相关的模型家族间能顺利工作。
  3. 阈值取决于工作负载。单一全局
    tau
    值很少能同时适用于长/短上下文和易/难任务。当压缩过于激进时,可能会出现准确率骤降的情况。
  4. 基准测试范围较窄。公开结果主要聚焦于长文档问答。代码生成、数学计算和多文档合成的表现可能有所不同。
  5. 编排器的差异性可能掩盖信号。随机编排器可能会大幅改变轨迹,从而掩盖压缩带来的微小收益或损失。
  6. 弱基准会夸大表观收益。在宣称系统级优势之前,请与强大的文本层替代方案进行比较。

Integration

集成

  • context-optimization - Prefix caching and observation masking remain the default first moves; Latent Briefing is a more specialized optimization for compatible orchestrator-worker stacks.
  • multi-agent-patterns - Applies when multi-agent token cost is driven by supervisor trajectory replay, not only by coordination overhead.
  • context-compression - Text-layer summaries remain preferable when human-readable state, portability, or audit logs matter.
  • memory-systems - Helps decide when to keep cross-step state in external memory versus in the worker's latent state.
  • tool-design - Worker call shapes and task prompts determine which tokens score highly during compaction.
  • context-optimization - 前缀缓存和观察掩码仍是默认的首要优化手段;Latent Briefing是针对兼容编排器-工作者栈的更专用优化。
  • multi-agent-patterns - 当多智能体令牌成本由监督者轨迹重放驱动,而非仅由协调开销驱动时适用。
  • context-compression - 当需要可读状态、可移植性或审计日志时,文本层总结仍是更优选择。
  • memory-systems - 有助于决定何时将跨步骤状态保存在外部内存中,何时保存在工作者的潜在状态中。
  • tool-design - 工作者调用形式和任务提示决定了压缩过程中哪些令牌会获得高分。

References

参考资料

Internal reference:
  • Attention Matching formulation and task-guided scoring - Read when: needing the AM objective, how task-guided scoring changes the query source, or why a shared global mask matters for batching
Related skills in this collection:
  • context-optimization - Read when: the main need is prefix caching, observation masking, or text-layer compaction rather than worker KV manipulation
  • multi-agent-patterns - Read when: deciding whether the architecture should be orchestrator-worker at all
  • context-compression - Read when: human-readable summaries may be a better fit than latent transfer
  • memory-systems - Read when: comparing in-model latent state with external persistent memory
External resources:

内部参考:
  • Attention Matching公式与任务引导评分 - 适用场景:需要了解AM目标、任务引导评分如何改变查询来源,或为何共享全局掩码对批处理至关重要
本集合中的相关技能:
  • context-optimization - 适用场景:主要需求是前缀缓存、观察掩码或文本层压缩,而非工作者KV操作
  • multi-agent-patterns - 适用场景:决定架构是否应采用编排器-工作者模式
  • context-compression - 适用场景:可读总结可能比潜在传递更合适
  • memory-systems - 适用场景:比较模型内潜在状态与外部持久内存
外部资源:

Skill Metadata

技能元数据

Created: 2026-04-14 Last Updated: 2026-04-14 Author: Agent Skills for Context Engineering Contributors; primary technical source Ramp Labs (public post) Version: 1.1.0
创建时间: 2026-04-14 最后更新时间: 2026-04-14 作者: 上下文工程智能体技能贡献者;主要技术来源为Ramp Labs(公开帖子) 版本: 1.1.0