audit-prompt-caching

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Cache Audit

提示缓存审计

Diagnose and fix LLM prompt/prefix cache misses. Treat caching as an engineering property of the request path: stable prefix, cache-aware routing, and cache entries that live long enough to be reused.
Caching is an optimization only when the prefix is stable, long enough, repeated, measurable, and safe. Do not add cache controls, cache keys, or routing hints blindly.
诊断并修复LLM提示/前缀缓存未命中问题。将缓存视为请求路径的一项工程属性:稳定前缀、缓存感知路由,以及存在时间足够长可被复用的缓存条目。
只有当前缀稳定、长度足够、可重复、可衡量且安全时,缓存才是有效的优化手段。请勿盲目添加缓存控制、缓存键或路由提示。

When to use

适用场景

Use this skill when reviewing or designing LLM calls where repeated prompt prefixes may reduce cost or latency through provider-native prompt caching, managed-router cache locality, or self-hosted KV reuse.
Typical triggers:
  • cached_tokens=0
    ,
    cache_read_input_tokens=0
    , or cache writes without reads.
  • Cache hit rate, TTFT, prefill latency, or input-token cost regressed.
  • User says LLM cost or speed regressed around repeated long prompts, long-context agents, or shared static context.
  • LLM request shape changed where repeated long prompts, TTFT, cached-token telemetry, or LLM cost matter.
  • Prompt text, message order, request builders, tools, schemas,
    response_format
    , provider API surface, model/router settings, agent loop structure, context compaction, or inference deployment changed.
  • The request uses long system prompts, tool catalogs, schemas, static documents, few-shot examples, or repeated RAG/CAG context.
  • The app uses OpenAI
    prompt_cache_key
    , Anthropic
    cache_control
    , Bedrock
    cachePoint
    , OpenRouter routing, Gemini/Qwen/DeepSeek cache fields, or Azure OpenAI cached-token telemetry.
  • An agent changes tools, compacts history, mutates early messages, or switches modes across steps.
  • vLLM/SGLang/self-hosted deployments have multi-replica routing, KV pressure, tokenizer/chat-template drift, or cache-aware routing questions.
当审查或设计LLM调用时,若重复的提示前缀可通过供应商原生提示缓存、托管路由缓存局部性或自托管KV复用降低成本或延迟,可使用本技能。
典型触发场景:
  • cached_tokens=0
    cache_read_input_tokens=0
    ,或仅写入缓存但未读取
  • 缓存命中率、TTFT、预填充延迟或输入令牌成本出现退化
  • 用户反馈重复长提示、长上下文Agent或共享静态上下文导致LLM成本或速度退化
  • LLM请求形状发生变化,且重复长提示、TTFT、缓存令牌遥测或LLM成本是关注点
  • 提示文本、消息顺序、请求构建器、工具、架构、
    response_format
    、供应商API接口、模型/路由设置、Agent循环结构、上下文压缩或推理部署发生变更
  • 请求使用长系统提示、工具目录、架构、静态文档、少样本示例或重复的RAG/CAG上下文
  • 应用使用OpenAI
    prompt_cache_key
    、Anthropic
    cache_control
    、Bedrock
    cachePoint
    、OpenRouter路由、Gemini/Qwen/DeepSeek缓存字段或Azure OpenAI缓存令牌遥测
  • Agent在步骤间变更工具、压缩历史、修改早期消息或切换模式
  • vLLM/SGLang/自托管部署存在多副本路由、KV压力、令牌器/聊天模板漂移或缓存感知路由问题

When not to use

不适用场景

Do not use this skill for:
  • generic prompt writing or prompt-quality editing without a caching concern
  • ordinary short prompt edits where no repeated long prefix, TTFT, cache telemetry, or LLM cost concern exists
  • generic RAG design unless repeated context placement/cacheability is part of the task
  • token counting or context-window sizing only
  • response caching only, unless comparing it with prompt prefix caching
  • non-LLM frontend/backend performance or non-inference Kubernetes routing
  • speculative savings claims without usage data or clearly stated assumptions
请勿将本技能用于:
  • 无缓存相关需求的通用提示编写或提示质量编辑
  • 无重复长前缀、TTFT、缓存遥测或LLM成本顾虑的普通短提示编辑
  • 通用RAG设计,除非重复上下文放置/可缓存性是任务的一部分
  • 仅令牌计数或上下文窗口大小调整
  • 仅响应缓存,除非与提示前缀缓存进行对比
  • 非LLM的前端/后端性能问题或非推理类Kubernetes路由
  • 无使用数据或明确假设的投机性节省主张

Modes

模式

  • Code audit: inspect prompt construction, tool/schema serialization, history management, provider calls, routing, and engine config. Propose focused diffs and verify them.
  • Advisory: if no codebase is available, ask targeted diagnostic questions and give provider-checked recommendations.
  • Agent audit: when tools, tool routing, MCP, agent loops, compaction, or multi-step trajectories are present, always run the agent-specific checks.
  • Deployment audit: when vLLM/SGLang, Kubernetes, Docker Compose, gateways, or multiple inference replicas are present, inspect routing and KV-cache capacity as first-class causes.
  • 代码审计:检查提示构建、工具/架构序列化、历史管理、供应商调用、路由和引擎配置。提出针对性的代码变更并验证。
  • 咨询建议:若无代码库可用,提出针对性诊断问题并给出经供应商验证的建议。
  • Agent审计:当存在工具、工具路由、MCP、Agent循环、压缩或多步骤轨迹时,务必执行Agent专属检查。
  • 部署审计:当存在vLLM/SGLang、Kubernetes、Docker Compose、网关或多个推理副本时,将路由和KV缓存容量作为首要检查对象。

Default Project Audit Workflow

默认项目审计工作流

When a project or repository is available, start with code and configuration. This is the primary workflow for the skill.
  1. Scan the repo for provider calls, cache controls, routing hints, prompt builders, tool/schema registries, and self-hosted engine signals. Use
    scripts/extract_llm_calls.py
    when useful.
  2. Inspect the request path in code: prompt rendering, system/developer messages, tool ordering, structured-output schemas, history management, compaction, and provider SDK parameters.
  3. Inspect config and deployment files: environment defaults, feature flags, gateway/router settings, Docker Compose, Kubernetes, Helm, vLLM/SGLang flags, and replica topology.
  4. Load only the relevant provider and scenario references, then apply the audit flow and anti-pattern checks.
  5. Ask for usage logs, rendered request payloads, traces, or billing exports only when code/config review needs telemetry evidence, prefix comparison, ROI math, or incident correlation.
当有项目或仓库可用时,从代码和配置开始。这是本技能的核心工作流。
  1. 扫描仓库中的供应商调用、缓存控制、路由提示、提示构建器、工具/架构注册表和自托管引擎信号。必要时使用
    scripts/extract_llm_calls.py
  2. 检查代码中的请求路径:提示渲染、系统/开发者消息、工具排序、结构化输出架构、历史管理、压缩和供应商SDK参数。
  3. 检查配置和部署文件:环境默认值、功能标志、网关/路由设置、Docker Compose、Kubernetes、Helm、vLLM/SGLang标志和副本拓扑。
  4. 仅加载相关的供应商和场景参考,然后应用审计流程和反模式检查。
  5. 仅当代码/配置审查需要遥测证据、前缀对比、ROI计算或事件关联时,才请求使用日志、渲染后的请求负载、跟踪数据或账单导出。

Audit Inputs

审计输入

Treat the repository, prompt code, and deployment configuration as the main audit inputs. Evidence artifacts such as provider usage logs, billing exports, rendered JSON request payloads, prompt snapshots, per-step agent traces, gateway route logs, and latency traces are supporting inputs for confirmation and measurement.
Bundled fixtures are only demo and regression-test data. Do not require users to convert production data into the repository's fixture layout before auditing. If a user asks whether fixtures are required, say no: the skill audits project code and configs first, and the scripts can also accept normal JSON, JSONL, CSV usage logs, or JSON request payloads directly.
This skill does not capture or intercept live traffic by itself. If telemetry is needed, ask the user to export or redact representative records from their own logging, tracing, provider dashboard, or billing pipeline.
将代码库、提示代码和部署配置视为主要审计输入。供应商使用日志、账单导出、渲染后的JSON请求负载、提示快照、Agent步骤跟踪数据、网关路由日志和延迟跟踪数据等证据工件作为确认和测量的辅助输入。
捆绑的测试数据仅用于演示和回归测试。无需用户将生产数据转换为仓库的测试数据格式后再进行审计。若用户询问是否需要测试数据,回答不需要:本技能优先审计项目代码和配置,脚本也可直接接受普通JSON、JSONL、CSV使用日志或JSON请求负载。
本技能本身不会捕获或拦截实时流量。若需要遥测数据,请用户从自身日志、跟踪系统、供应商仪表盘或账单管道导出或编辑代表性记录。

Project Context Gate

项目上下文校验

Before assigning severity or recommending project changes, review hot paths, repeat cadence, prompt families, and cache applicability. Identify which LLM routes are frequent enough, long enough, repeated enough, stable enough, and safe enough for prompt/prefix caching to matter.
Do this before deep provider advice:
  1. Map prompt families, request builders, model/provider routes, agent loops, deployment paths, and usage frequency.
  2. Separate hot repeated paths from rare jobs, one-off prompts, admin flows, experiments, and prompt families that cannot share a stable prefix.
  3. Mark each finding as applicable, conditionally applicable, or not applicable to the project path under review.
  4. Ask for telemetry only after code/config context shows the evidence needed next.
If project context shows a route runs rarely, changes prompt families often, has no stable long prefix, or is dominated by output/tool latency, say prompt caching is not the right lever for that route. Do not keep generic cache warnings in the findings list after they are known to be not applicable.
在确定问题严重性或推荐项目变更前,先审查核心路径、重复频率、提示族和缓存适用性。确定哪些LLM路由的调用频率、长度、重复次数、稳定性和安全性足够高,使得提示/前缀缓存能发挥作用。
在提供深入的供应商建议前完成以下步骤:
  1. 绘制提示族、请求构建器、模型/供应商路由、Agent循环、部署路径和使用频率的映射图。
  2. 将高频重复路径与罕见任务、一次性提示、管理流程、实验以及无法共享稳定前缀的提示族区分开。
  3. 将每个发现标记为适用于、有条件适用于或不适用于当前审查的项目路径。
  4. 仅当代码/配置上下文显示需要后续证据时,才请求遥测数据。
若项目上下文显示某路由调用频率低、提示族频繁变更、无稳定长前缀,或延迟主要由输出/工具导致,则提示缓存并非该路由的合适优化手段。确认不适用后,请勿将通用缓存警告保留在发现列表中。

Applicability Gate

适用性校验

Before recommending prompt-cache changes, check:
  1. Reusable prefix: Is there a static or semi-static prefix above the provider/model threshold or large enough to matter for self-hosted KV reuse?
  2. Repeat cadence: Is the same prefix reused often enough before cache expiry or eviction?
  3. Exact stability: Are tools, schemas, system/developer instructions, examples, images, and early messages byte/token stable across target requests?
  4. Telemetry: Are cache-read/write fields, input/output tokens, TTFT/prefill timing, model/route, and prompt version available?
  5. Cost shape: Is input prefill/input-token cost meaningful, or do output tokens/decode/tool latency dominate?
  6. Safety boundary: Would broader cache reuse violate tenant, privacy, data residency, ZDR, or side-channel requirements?
If the gate fails, report why caching is not the right lever yet and recommend measurement, prompt restructuring, routing fixes, or a different optimization.
在推荐提示缓存变更前,检查以下内容:
  1. 可复用前缀:是否存在达到供应商/模型阈值或长度足够支持自托管KV复用的静态或半静态前缀?
  2. 重复频率:同一前缀在缓存过期或被驱逐前是否足够频繁地被复用?
  3. 完全稳定性:工具、架构、系统/开发者指令、示例、图片和早期消息在目标请求间是否保持字节/令牌级稳定?
  4. 遥测能力:是否可获取缓存读/写字段、输入/输出令牌、TTFT/预填充时间、模型/路由和提示版本数据?
  5. 成本构成:输入预填充/输入令牌成本是否显著,还是输出令牌/解码/工具延迟占主导?
  6. 安全边界:更广泛的缓存复用是否会违反租户、隐私、数据驻留、ZDR或侧信道要求?
若校验不通过,说明缓存目前并非合适的优化手段,并建议进行测量、提示重构、路由修复或其他优化方式。

Language Match Rule

语言匹配规则

Answer in the user's language by default. Preserve provider/API field names exactly, such as
cached_tokens
,
cache_control
,
cachePoint
,
TTFT
,
prompt_cache_key
, and
response_format
, but explain their meaning in the user's language instead of switching the report into English. If the user asks in Russian, headings, severity explanations, and recommendations should be in Russian unless they are literal API names or quoted code.
默认使用用户的语言作答。完全保留供应商/API字段名称,如
cached_tokens
cache_control
cachePoint
TTFT
prompt_cache_key
response_format
,但用用户的语言解释其含义,而非将报告切换为英文。若用户用俄语提问,标题、严重性说明和建议应使用俄语,除非是字面API名称或引用的代码。

Agent-First Output Contracts

Agent优先输出约定

Pick the smallest contract that answers the user's actual request. Do not bury the decision under general prompt-cache advice.
  • Quick triage: use when artifacts are incomplete. Answer with provider/engine guess, most likely cache blocker, evidence needed next, and one safe next command or artifact request.
  • Code audit findings: use when code is available. Lead with a decision summary, then file-line findings in the report format, then clean checks, then verification commands.
  • Provider migration risk: use when moving between OpenAI, Anthropic, Bedrock, OpenRouter, Azure OpenAI, Gemini, Qwen, DeepSeek, or self-hosted engines. Compare cache semantics, usage fields, prefix layout risk, routing risk, and cost assumptions before recommending edits.
  • Agent loop audit: use for coding assistants, MCP clients, tool-using agents, compaction, mode switching, or long multi-step workflows. Always inspect stable tools, early messages, per-step prefix hashes, cache fields, output tokens, and compaction events.
  • Deployment audit: use for vLLM, SGLang, Kubernetes, Docker Compose, gateways, autoscaling, or multi-replica inference. Treat routing locality and KV budget as first-class causes, not secondary deployment details.
  • Not worth caching: use when the Applicability Gate fails or the evidence shows output decode, external tool latency, rate limits, or privacy isolation dominate. Say what should change instead and what evidence would reopen prompt-cache work.
When recommending project work, start with this compact decision summary:
text
Measurement change:
Prompt behavior change:
Provider/routing change:
Confidence:
Do first:
Do not do yet:
Prefer split recommendations over a single broad "yes" when the safe next step is measurement. For example,
Measurement change: yes
,
Prompt behavior change: pilot only after telemetry
, and
Provider/routing change: no, not yet
.
For "do we need to change the project?" questions, answer first with
Change needed: yes
,
Change needed: no
, or
Change needed: unknown until <specific evidence>
when a single answer is accurate. If the change types differ, use the split decision summary above. Then list exact files/settings to change or explicitly state that no project change is justified yet.
选择能准确响应用户实际请求的最小约定。请勿将决策淹没在通用提示缓存建议中。
  • 快速分类:当工件不完整时使用。给出供应商/引擎猜测、最可能的缓存阻塞因素、后续所需证据以及一个安全的下一步操作或工件请求。
  • 代码审计发现:当有代码可用时使用。以决策摘要开头,然后按报告格式列出文件-行发现,接着是合规检查,最后是验证命令。
  • 供应商迁移风险:当在OpenAI、Anthropic、Bedrock、OpenRouter、Azure OpenAI、Gemini、Qwen、DeepSeek或自托管引擎间迁移时使用。在推荐编辑前,对比缓存语义、使用字段、前缀布局风险、路由风险和成本假设。
  • Agent循环审计:用于编码助手、MCP客户端、使用工具的Agent、压缩、模式切换或长多步骤工作流。务必检查稳定工具、早期消息、每步前缀哈希、缓存字段、输出令牌和压缩事件。
  • 部署审计:用于vLLM、SGLang、Kubernetes、Docker Compose、网关、自动扩缩容或多副本推理。将路由局部性和KV预算视为首要原因,而非次要部署细节。
  • 无需缓存:当适用性校验不通过,或证据显示输出解码、外部工具延迟、速率限制或隐私隔离占主导时使用。说明应进行的替代变更,以及哪些证据会重新开启提示缓存优化工作。
推荐项目工作时,先以以下简洁的决策摘要开头:
text
测量变更:
提示行为变更:
供应商/路由变更:
置信度:
优先执行:
暂不执行:
当安全的下一步是测量时,优先选择拆分式建议而非单一的宽泛“是”。例如:
测量变更:是
提示行为变更:仅在获取遥测数据后试点
供应商/路由变更:否,暂不执行
对于“我们是否需要修改项目?”这类问题,若单一答案准确,先回答
需要变更:是
需要变更:否
需要变更:未知,需<特定证据>
。若变更类型不同,使用上述拆分式决策摘要。然后列出需修改的具体文件/设置,或明确说明目前无需变更项目。

Evidence-Bearing Findings

带证据的发现

Every actionable finding should make uncertainty visible. Include these fields in prose or in the extended report format:
text
source | severity | provider/engine | issue | evidence | evidence_type | confidence | impact_condition | cache impact | safe_first_action | fix | validation | do_not_do_yet
Use evidence types such as
confirmed from code
,
confirmed from telemetry
,
provider-doc hypothesis
, or
needs validation
. State impact conditions like "matters if this path is hot, repeated, and has a long stable prefix" instead of implying guaranteed savings. Include one safe first action, one validation metric/command, and one thing not to do yet when the next risky change would be premature.
For review-style responses, keep the result compact and project-specific:
  1. Confirmed findings: issues supported by code/config/telemetry and applicable to the project path.
  2. Hypotheses: plausible cache risks that need usage logs, rendered payloads, route metrics, or provider docs before severity can rise.
  3. Not applicable: generic cache advice that the project context rules out, such as prefix-cache work for a once-daily prompt family with no meaningful repeated stable prefix.
每个可执行的发现应明确不确定性。在 prose 或扩展报告格式中包含以下字段:
text
source | severity | provider/engine | issue | evidence | evidence_type | confidence | impact_condition | cache impact | safe_first_action | fix | validation | do_not_do_yet
使用证据类型如
代码确认
遥测确认
供应商文档假设
需验证
。说明影响条件,例如“仅当该路径是高频、重复且有长稳定前缀时才重要”,而非暗示必然节省。包含一个安全的第一步操作、一个验证指标/命令,以及一个当前不应执行的高风险操作。
对于评审类响应,保持结果简洁且针对项目:
  1. 已确认发现:由代码/配置/遥测支持且适用于项目路径的问题。
  2. 假设:合理的缓存风险,但需使用日志、渲染后的负载、路由指标或供应商文档才能确定严重性。
  3. 不适用:项目上下文已排除的通用缓存建议,例如针对每日仅调用一次且无有意义重复稳定前缀的提示族进行前缀缓存优化。

Explicit Review Default

显式评审默认规则

If this skill is explicitly invoked and the user asks only "review", "do a review", "сделай ревью", or equivalent, default to a cache-focused review of the available diff or repository. Treat the request as a prompt/prefix/KV cache audit: detect provider and engine signals, inspect LLM request shape, and report cache-impact findings first. Do not perform a general code review unless the user explicitly asks for one.
若本技能被显式调用,且用户仅要求“评审”、“做评审”或类似表述,默认对可用的代码差异或仓库进行以缓存为重点的评审。将请求视为提示/前缀/KV缓存审计:检测供应商和引擎信号,检查LLM请求形状,并优先报告影响缓存的发现。除非用户明确要求,否则不进行通用代码评审。

Use-Case Map

用例映射

Classify the work before auditing so you inspect the right artifacts. For a deeper role/artifact matrix, load
references/use-cases.md
.
ScenarioCommon triggersInspect first
Cost or migration auditbill increased, provider comparison, cache discount not visibleusage logs, billing export, static/dynamic/output token estimates, provider reference
Prompt/code audit
cached_tokens=0
, prompt builder changed, schema drift
prompt renderers, SDK calls,
tools
,
response_format
, JSON/schema serialization
Mechanics/latency auditcache hit did not reduce cost/latency, decode dominates, unclear prefill vs output
references/mechanics.md
, token/TTFT traces, output length, streaming timestamps
Managed-router auditOpenRouter cache writes without reads, provider fallback, sticky routing,
openrouter/auto
OpenRouter request body,
provider
routing fields, model(s), plugins, usage metadata
Agent/coding-assistant auditagent got more expensive, dynamic tools, MCP routing, compactionagent loop, tool registry, tool selection, history compaction, per-step cache logs
Deployment auditvLLM/SGLang cache misses, TTFT after scaling, multi-replica routing
docker-compose.yml
, Helm values, Kubernetes manifests, gateway config, engine flags
Observability/CI auditneed cache dashboard, release guardrail, prefix smoke testtraces, dashboards, rendered prompt snapshots, prefix/tool/schema hashes
在审计前先对工作进行分类,以便检查正确的工件。如需更详细的角色/工件矩阵,加载
references/use-cases.md
场景常见触发因素优先检查
成本或迁移审计账单增加、供应商对比、未体现缓存折扣使用日志、账单导出、静态/动态/输出令牌估算、供应商参考
提示/代码审计
cached_tokens=0
、提示构建器变更、架构漂移
提示渲染器、SDK调用、
tools
response_format
、JSON/架构序列化
机制/延迟审计缓存命中未降低成本/延迟、解码占主导、预填充与输出区分不清
references/mechanics.md
、令牌/TTFT跟踪数据、输出长度、流时间戳
托管路由审计OpenRouter仅写入缓存未读取、供应商 fallback、粘性路由、
openrouter/auto
OpenRouter请求体、
provider
路由字段、模型、插件、使用元数据
Agent/编码助手审计Agent成本增加、动态工具、MCP路由、压缩Agent循环、工具注册表、工具选择、历史压缩、每步缓存日志
部署审计vLLM/SGLang缓存未命中、扩缩容后TTFT问题、多副本路由
docker-compose.yml
、Helm配置、Kubernetes清单、网关配置、引擎标志
可观测性/CI审计需要缓存仪表盘、发布防护、前缀冒烟测试跟踪数据、仪表盘、渲染后的提示快照、前缀/工具/架构哈希

Scenario References

场景参考

Load only the reference needed for the detected scenario:
  • Cost or migration:
    references/economics.md
    for effective-cost variables, output-share checks, TTL/write-premium break-even, and migration cache risk.
  • Mechanics, latency, or self-hosted compute:
    references/mechanics.md
    for prefill vs decode, KV reuse, and what cache hits can and cannot improve.
  • Release, incident, deploy, or monitoring:
    references/predeploy-checklist.md
    for blocking checks, triage order, and observability dimensions.
  • OpenRouter or managed provider routing:
    references/openrouter.md
    for sticky routing, provider fallback/order, cache usage fields, and provider-specific cache controls through OpenRouter.
  • Agents, coding assistants, MCP, or dynamic tools:
    references/agent-tools.md
    for tool strategy selection, mode switching, and context compaction.
  • Self-hosted SGLang:
    references/sglang.md
    for RadixAttention, SGLang router, HiCache, tokenizer/chat-template drift, and cache-aware deployment checks.
  • Full audit deliverable:
    references/report-template.md
    when the user asks for a written report or when findings need a reusable handoff artifact.
  • Machine-readable rules:
    references/rules.json
    when scripting, rendering, or validating findings by anti-pattern ID.
仅加载检测到的场景所需的参考:
  • 成本或迁移
    references/economics.md
    ,包含有效成本变量、输出占比检查、TTL/写入溢价收支平衡点和迁移缓存风险。
  • 机制、延迟或自托管计算
    references/mechanics.md
    ,包含预填充与解码对比、KV复用以及缓存命中能改善和不能改善的内容。
  • 发布、事件、部署或监控
    references/predeploy-checklist.md
    ,包含阻塞检查、分类顺序和可观测性维度。
  • OpenRouter或托管供应商路由
    references/openrouter.md
    ,包含粘性路由、供应商 fallback/顺序、缓存使用字段以及通过OpenRouter实现的供应商专属缓存控制。
  • Agent、编码助手、MCP或动态工具
    references/agent-tools.md
    ,包含工具策略选择、模式切换和上下文压缩。
  • 自托管SGLang
    references/sglang.md
    ,包含RadixAttention、SGLang路由、HiCache、令牌器/聊天模板漂移和缓存感知部署检查。
  • 完整审计交付物
    references/report-template.md
    ,当用户要求书面报告或发现需要可复用的交接工件时使用。
  • 机器可读规则
    references/rules.json
    ,当通过反模式ID编写脚本、渲染或验证发现时使用。

Bundled Scripts

捆绑脚本

Use scripts when deterministic evidence is better than prose:
  • scripts/prefix_stability_check.py
    : compare two rendered prompts or JSON request payloads as raw bytes by default and find the first divergent prefix location; use
    --canonical-json
    only when sorted-key normalization is intentional.
  • scripts/layout_linter.py
    : inspect JSON request payloads, including Chat-style
    messages
    and Responses-style
    input
    /
    instructions
    , for volatile early content, unsorted tools, and dynamic schema fields before doing deeper manual layout review.
  • scripts/analyze_usage_logs.py
    : summarize JSON/JSONL/CSV usage logs across OpenAI, Anthropic-compatible, Bedrock-style, and OpenAI-compatible cache fields; use
    --jsonl-normalized
    when a downstream report or dashboard needs per-record canonical events.
  • scripts/estimate_cache_roi.py
    : estimate input-only and total-cost impact from static/dynamic/output tokens, hit rate, request count, and explicit pricing assumptions.
  • scripts/extract_llm_calls.py
    : scan a repository for likely LLM provider calls, cache-control fields, routing signals, and self-hosted engine hints before choosing provider references.
  • scripts/render_audit_report.py
    : combine usage-log summaries and one-line findings into a reusable Markdown or JSON audit report.
  • scripts/validate_skill_package.py
    : validate frontmatter, referenced files, JSON evals, and Python helper syntax before sharing or publishing the skill.
  • scripts/run_trigger_eval.py
    : summarize positive and negative trigger-eval coverage from
    evals/trigger_eval.json
    .
Do not treat these scripts as provider tokenizers or billing truth. Provider usage and billing exports remain authoritative.
当确定性证据比 prose 更有效时使用脚本:
  • scripts/prefix_stability_check.py
    :默认按原始字节对比两个渲染后的提示或JSON请求负载,找出第一个前缀分歧位置;仅当有意进行排序键归一化时使用
    --canonical-json
  • scripts/layout_linter.py
    :检查JSON请求负载,包括聊天式
    messages
    和响应式
    input
    /
    instructions
    ,查找易变的早期内容、未排序的工具和动态架构字段,再进行深入的手动布局审查。
  • scripts/analyze_usage_logs.py
    :汇总OpenAI、Anthropic兼容、Bedrock风格和OpenAI兼容缓存字段的JSON/JSONL/CSV使用日志;当下游报告或仪表盘需要每条记录的标准事件时使用
    --jsonl-normalized
  • scripts/estimate_cache_roi.py
    :根据静态/动态/输出令牌、命中率、请求次数和明确的定价假设估算仅输入成本和总成本影响。
  • scripts/extract_llm_calls.py
    :扫描仓库中可能的LLM供应商调用、缓存控制字段、路由信号和自托管引擎提示,再选择供应商参考。
  • scripts/render_audit_report.py
    :将使用日志摘要和单行发现合并为可复用的Markdown或JSON审计报告。
  • scripts/validate_skill_package.py
    :在共享或发布技能前,验证前置内容、引用文件、JSON评估和Python助手语法。
  • scripts/run_trigger_eval.py
    :汇总
    evals/trigger_eval.json
    中的正负触发评估覆盖情况。
请勿将这些脚本视为供应商令牌器或账单的权威来源。供应商使用数据和账单导出仍是权威依据。

Script Transparency Rule

脚本透明规则

Before running any bundled script, explain what each bundled script reads, writes, and whether it uses network. Also state why the script is needed, whether it scans the whole repository or a targeted path, and the expected runtime class: seconds, tens of seconds, or minutes.
Default to a targeted scan when the repo is large or the user asked a narrow question. If a script may read secrets, environment files, generated artifacts, large logs, or production exports, say that explicitly and ask for approval unless the user has already requested that exact scan. Do not send files to a network service from these scripts. If network access is needed for provider docs or package installation, treat it as a separate explicit action.
运行任何捆绑脚本前,解释每个脚本读取、写入的内容,以及是否使用网络。同时说明使用脚本的原因,是扫描整个仓库还是目标路径,以及预期运行时间:秒级、几十秒级或分钟级。
当仓库较大或用户提出的问题范围较窄时,默认进行目标扫描。若脚本可能读取机密信息、环境文件、生成的工件、大型日志或生产导出,需明确说明并请求批准,除非用户已明确要求该扫描。请勿通过这些脚本将文件发送到网络服务。若需要访问供应商文档或安装包等网络资源,视为单独的显式操作。

Freshness Gate

新鲜度校验

Provider facts are volatile. Before making exact provider claims, open the relevant provider reference and verify its official sources when browsing is available.
Verify before exact claims about:
  • pricing, cache discounts, storage charges, or write premiums
  • current model names, support matrices, and availability by region
  • minimum cacheable tokens, cache granularity, TTL, and retention
  • usage field names and API parameters
  • tool-search, allowed-tools, defer-loading, or cache-control semantics
If official docs cannot be checked, say the provider facts are unverified and avoid exact numbers. Use bundled references as heuristics, not current truth. If you say "verified today" or similar, cite the official source URLs or page names used in the run. Never copy prices or model names from articles/posts as current facts.
供应商信息易变。在做出明确的供应商声明前,若可浏览,打开相关供应商参考并验证其官方来源。
在做出以下明确声明前进行验证:
  • 定价、缓存折扣、存储费用或写入溢价
  • 当前模型名称、支持矩阵和区域可用性
  • 最小可缓存令牌数、缓存粒度、TTL和保留期
  • 使用字段名称和API参数
  • 工具搜索、允许的工具、延迟加载或缓存控制语义
若无法检查官方文档,说明供应商信息未经验证,避免使用精确数字。将捆绑参考视为启发式信息,而非当前事实。若使用“今日验证”等表述,需引用运行时使用的官方源URL或页面名称。切勿从文章/帖子中复制价格或模型名称作为当前事实。

Provider Detection

供应商检测

Search SDK imports, API base URLs, model names, deployment manifests, and config files. For self-hosted deployments, also search
docker-compose.yml
,
Dockerfile
, Helm values, Kubernetes
Deployment
/
Service
/
Ingress
, gateway config, and engine CLI flags.
SignalProvider/engineLoad
openrouter
,
openrouter.ai/api/v1
,
OPENROUTER_API_KEY
,
@openrouter/sdk
,
OpenRouter
,
openrouter/auto
OpenRouter
references/openrouter.md
AzureOpenAI
,
AZURE_OPENAI_ENDPOINT
,
azure.ai.openai
,
api-version
, Azure OpenAI endpoint URLs
Azure OpenAI
references/azure-openai.md
openai
,
responses.create
,
chat.completions
,
prompt_cache_key
,
prompt_cache_retention
OpenAI
references/openai.md
bedrock-runtime
,
BedrockRuntime
,
boto3.client("bedrock-runtime")
,
client.converse
,
converse_stream
,
InvokeModelCommand
,
ConverseCommand
,
invoke_model
,
cachePoint
,
CacheReadInputTokens
,
CacheWriteInputTokens
Amazon Bedrock
references/bedrock.md
anthropic
,
messages.create
,
cache_control
Anthropic
references/anthropic.md
vllm
,
--enable-prefix-caching
,
AsyncLLMEngine
,
LLM(
vLLM
references/vllm.md
sglang
,
sglang_router
,
RadixAttention
,
--disable-radix-cache
,
HiCache
SGLang
references/sglang.md
deepseek
,
api.deepseek.com
,
prompt_cache_hit_tokens
DeepSeek
references/deepseek.md
google.genai
,
google.generativeai
,
vertexai
,
CachedContent
Gemini
references/gemini.md
dashscope
,
qwen
,
bailian
,
aliyun
Qwen/DashScope
references/qwen.md
yandexgpt
,
foundationModels
,
llm.api.cloud.yandex.net
YandexGPT
references/yandexgpt.md
z.ai
,
zai
,
glm-
,
api.z.ai
z.ai
references/zai.md
Load only the relevant provider files. If OpenRouter, Azure, or Bedrock signals appear alongside OpenAI/Anthropic-compatible calls, prefer the router/provider wrapper reference over the generic direct-provider reference. If detection is ambiguous, ask which provider/engine is in use.
搜索SDK导入、API基础URL、模型名称、部署清单和配置文件。对于自托管部署,还需搜索
docker-compose.yml
Dockerfile
、Helm配置、Kubernetes
Deployment
/
Service
/
Ingress
、网关配置和引擎CLI标志。
信号供应商/引擎加载参考
openrouter
openrouter.ai/api/v1
OPENROUTER_API_KEY
@openrouter/sdk
OpenRouter
openrouter/auto
OpenRouter
references/openrouter.md
AzureOpenAI
AZURE_OPENAI_ENDPOINT
azure.ai.openai
api-version
、Azure OpenAI端点URL
Azure OpenAI
references/azure-openai.md
openai
responses.create
chat.completions
prompt_cache_key
prompt_cache_retention
OpenAI
references/openai.md
bedrock-runtime
BedrockRuntime
boto3.client("bedrock-runtime")
client.converse
converse_stream
InvokeModelCommand
ConverseCommand
invoke_model
cachePoint
CacheReadInputTokens
CacheWriteInputTokens
Amazon Bedrock
references/bedrock.md
anthropic
messages.create
cache_control
Anthropic
references/anthropic.md
vllm
--enable-prefix-caching
AsyncLLMEngine
LLM(
vLLM
references/vllm.md
sglang
sglang_router
RadixAttention
--disable-radix-cache
HiCache
SGLang
references/sglang.md
deepseek
api.deepseek.com
prompt_cache_hit_tokens
DeepSeek
references/deepseek.md
google.genai
google.generativeai
vertexai
CachedContent
Gemini
references/gemini.md
dashscope
qwen
bailian
aliyun
Qwen/DashScope
references/qwen.md
yandexgpt
foundationModels
llm.api.cloud.yandex.net
YandexGPT
references/yandexgpt.md
z.ai
zai
glm-
api.z.ai
z.ai
references/zai.md
仅加载相关的供应商文件。若OpenRouter、Azure或Bedrock信号与OpenAI/Anthropic兼容调用同时出现,优先使用路由/供应商包装器参考而非通用直接供应商参考。若检测结果不明确,询问使用的供应商/引擎。

Audit Flow

审计流程

  1. Detect mode, provider, and use-case scenario.
  2. Load the relevant scenario reference and provider reference; do not load unrelated references.
  3. Apply the Freshness Gate for provider-specific facts.
  4. Run the Applicability Gate.
  5. Map prompt structure in order: tools, structured-output schemas, system/developer instructions, few-shot examples, static documents/context, retrieved context, conversation history, user-specific data, volatile values.
  6. Mark each segment as static, semi-static, dynamic, or volatile.
  7. Measure the symptom: cache ratio, TTFT/prefill latency, output/decode time, cache writes vs reads, and whether the drop correlates with deploys, SDK changes, prompt changes, replica count, or agent steps.
  8. Scan universal anti-patterns below.
  9. If an agent loop or tools are present, run the Agent Tool Stability checks.
  10. Apply provider-specific checks from the loaded reference.
  11. Report findings with evidence type, confidence, impact condition, safe first action, concrete fix, validation metric, and any premature action to avoid.
  12. When making code changes, verify prefix stability before claiming success.
  1. 检测模式、供应商和用例场景。
  2. 加载相关的场景参考和供应商参考;不加载无关参考。
  3. 对供应商特定信息应用新鲜度校验。
  4. 执行适用性校验。
  5. 按顺序映射提示结构:工具、结构化输出架构、系统/开发者指令、少样本示例、静态文档/上下文、检索到的上下文、对话历史、用户特定数据、易变值。
  6. 将每个段标记为静态、半静态、动态或易变。
  7. 测量症状:缓存比率、TTFT/预填充延迟、输出/解码时间、缓存写入与读取对比,以及指标下降是否与部署、SDK变更、提示变更、副本数或Agent步骤相关。
  8. 扫描以下通用反模式。
  9. 若存在Agent循环或工具,执行Agent工具稳定性检查。
  10. 应用加载的参考中的供应商特定检查。
  11. 报告发现,包含证据类型、置信度、影响条件、安全的第一步操作、具体修复方案、验证指标以及当前不应执行的操作。
  12. 进行代码变更时,先验证前缀稳定性再宣布成功。

Audit Playbooks

审计手册

Use these as starting paths for common support and review requests. Still run provider detection and the Freshness Gate before exact claims.
  • OpenAI cached_tokens=0: check prompt length/threshold, first-prefix drift,
    responses.create
    vs Chat usage fields,
    prompt_cache_key
    granularity,
    prompt_cache_retention
    , output-token dominance, and whether an OpenAI-compatible wrapper is actually in use.
  • Claude/Bedrock/OpenRouter writes without reads: distinguish cache creation/write fields from read/hit fields, then inspect cache breakpoint placement, dynamic content before the breakpoint, TTL/retention, model/region/API support, fallback routing, and actual routed provider/model.
  • Dynamic tools in long agent loops: compare
    tools_count
    , sorted tool-name hash,
    prefix_hash
    , mode state, and cache fields per step. Prefer stable route-level tool bundles, sorted schemas, provider-supported allowed tools/tool search/deferred loading, or self-hosted masking after checking current docs.
  • High hit rate but no savings: separate input savings from total cost and final latency. Check output-token share, decode time, external tool time, TPM/rate-limit behavior, and cache read/write pricing assumptions before changing prompt layout.
  • OpenAI-compatible wrapper ambiguity: if
    base_url
    , Azure, OpenRouter, Bedrock, DashScope/Qwen, or another gateway wraps an OpenAI SDK, load the wrapper reference first and do not recommend direct OpenAI-only parameters until the wrapper docs support them.
  • Self-hosted multi-replica miss: inspect gateway/service routing, prefix-aware hashing, tokenizer/chat-template drift,
    max_model_len
    , KV block pressure, eviction metrics, and route/replica-level hit metrics.
  • New provider docs project-change audit: compare the new provider facts against current code, references, evals, and tests. Recommend no code change when the project already encodes the behavior or when the fact is not applicable to this provider path.
将这些作为常见支持和评审请求的起始路径。在做出明确声明前仍需执行供应商检测和新鲜度校验。
  • OpenAI cached_tokens=0:检查提示长度/阈值、前缀初始漂移、
    responses.create
    与Chat使用字段对比、
    prompt_cache_key
    粒度、
    prompt_cache_retention
    、输出令牌主导情况,以及是否实际使用的是OpenAI兼容包装器。
  • Claude/Bedrock/OpenRouter仅写入未读取:区分缓存创建/写字段与读取/命名字段,然后检查缓存断点位置、断点前的动态内容、TTL/保留期、模型/区域/API支持、fallback路由和实际路由的供应商/模型。
  • 长Agent循环中的动态工具:对比每步的
    tools_count
    、排序后的工具名称哈希、
    prefix_hash
    、模式状态和缓存字段。优先使用稳定的路由级工具包、排序后的架构、供应商支持的允许工具/工具搜索/延迟加载,或在检查当前文档后使用自托管掩码。
  • 高命中率但无节省:区分输入节省与总成本和最终延迟。在更改提示布局前,检查输出令牌占比、解码时间、外部工具时间、TPM/速率限制行为和缓存读写定价假设。
  • OpenAI兼容包装器歧义:若
    base_url
    、Azure、OpenRouter、Bedrock、DashScope/Qwen或其他网关包装了OpenAI SDK,优先加载包装器参考,在包装器文档支持前不推荐仅适用于OpenAI的直接参数。
  • 自托管多副本未命中:检查网关/服务路由、前缀感知哈希、令牌器/聊天模板漂移、
    max_model_len
    、KV块压力、驱逐指标和路由/副本级命中指标。
  • 新供应商文档项目变更审计:将新的供应商信息与当前代码、参考、评估和测试进行对比。若项目已包含相关行为,或该信息不适用于此供应商路径,建议无需代码变更。

Rule Categories

规则分类

Use this taxonomy to keep audits consistent:
PriorityCategoryExamples
P0Provider correctnessOpenAI automatic caching, Responses vs Chat usage fields, Anthropic
cache_control
, Bedrock
cachePoint
, provider thresholds, TTL/retention
P1Prefix stabilitystatic-first ordering, dynamic-last placement, no volatile early values, stable tools/schemas, deterministic document order
P2Measurementcache hit ratio, cache write/read distinction, output-token share, TTFT vs final latency, prompt/tool/schema hashes
P3Architectureprompt cache vs response cache, RAG vs CAG, multi-tenant boundaries, managed routing, self-hosted replica locality
P4Reportingfile-line findings, before/after prompt layout, ROI assumptions, validation commands
使用此分类法保持审计一致性:
优先级分类示例
P0供应商正确性OpenAI自动缓存、Responses与Chat使用字段、Anthropic
cache_control
、Bedrock
cachePoint
、供应商阈值、TTL/保留期
P1前缀稳定性静态优先排序、动态内容后置、无早期易变值、稳定的工具/架构、确定的文档顺序
P2测量缓存命中率、缓存写入/读取区分、输出令牌占比、TTFT与最终延迟对比、提示/工具/架构哈希
P3架构提示缓存 vs 响应缓存、RAG vs CAG、多租户边界、托管路由、自托管副本局部性
P4报告文件-行发现、提示布局前后对比、ROI假设、验证命令

Severity

严重性

Applicability Before Severity

先校验适用性再确定严重性

Assign severity only after the Project Context Gate and Applicability Gate. A real anti-pattern in cold, sparse, single-run, output-bound, or non-cacheable routes is not automatically a high-severity cache finding.
Do not mark prefix-cache findings high severity without evidence that the affected route is hot, repeated within cache lifetime, has a long stable prefix, and has meaningful input-cost or TTFT impact. If those conditions are unknown, use
medium
,
low
, or
needs validation
, and state exactly what telemetry would raise or lower the severity.
Assign severity from impact and evidence, not from the anti-pattern name alone:
  • Critical: confirmed metric drop or cache miss on a large shared prefix, expensive model, high traffic, long agent trajectory, or multi-replica production path.
  • High: likely cache killer found in a hot path, or telemetry/traffic/token evidence shows meaningful cache/cost/TTFT impact but metrics are still incomplete.
  • Medium: pattern can fragment cache but impact depends on traffic shape.
  • Low: defensive cleanup, documentation, or monitoring improvement.
If hotness, prefix length, repeat cadence, or cost impact is unknown, prefer
medium
and state the condition that would escalate it to
high
.
仅在完成项目上下文校验和适用性校验后才分配严重性。在冷路径、稀疏调用、单次运行、输出主导或不可缓存路由中存在的真实反模式,并不自动构成高严重性缓存发现。
若无证据表明受影响路由是高频、在缓存生命周期内重复、有长稳定前缀且对输入成本或TTFT有显著影响,请勿将前缀缓存发现标记为高严重性。若这些条件未知,使用
中等
需验证
,并明确说明哪些遥测数据会提高或降低严重性。
根据影响和证据分配严重性,而非仅根据反模式名称:
  • 严重:在大型共享前缀、昂贵模型、高流量、长Agent轨迹或多副本生产路径上确认出现指标下降或缓存未命中。
  • :在高频路径中发现可能导致缓存失效的问题,或遥测/流量/令牌证据显示对缓存/成本/TTFT有显著影响,但指标仍不完整。
  • 中等:模式可能导致缓存碎片化,但影响取决于流量形态。
  • :防御性清理、文档或监控改进。
若热度、前缀长度、重复频率或成本影响未知,优先使用
中等
并说明将其升级为
的条件。

Universal Anti-Patterns

通用反模式

AP-1: Volatile Data In Prefix

AP-1:前缀中包含易变数据

Dynamic values near the beginning make every request unique.
Search for:
datetime
,
time
,
uuid
,
session_id
,
request_id
,
trace_id
,
run_id
,
user.name
,
user_id
,
tenant
,
company
,
cwd
,
git status
,
platform
in system prompts, tool schemas, or early messages.
Fix:
python
undefined
开头附近的动态值会使每个请求都唯一。
搜索:系统提示、工具架构或早期消息中的
datetime
time
uuid
session_id
request_id
trace_id
run_id
user.name
user_id
tenant
company
cwd
git status
platform
修复:
python
undefined

bad

错误写法

system = f"Today: {datetime.now()}. You help {user.name}. {BASE_PROMPT}"
system = f"Today: {datetime.now()}. You help {user.name}. {BASE_PROMPT}"

good

正确写法

system = BASE_PROMPT user_msg = f"[ctx: time={datetime.now()}, user={user.name}] {query}"

Verify: render the cacheable prefix for multiple users/timestamps and confirm its fingerprint does not change.
system = BASE_PROMPT user_msg = f"[ctx: time={datetime.now()}, user={user.name}] {query}"

验证:为多个用户/时间戳渲染可缓存前缀,确认其指纹未变化。

AP-2: Non-Deterministic Tools, Schemas, Or Serialization

AP-2:工具、架构或序列化非确定性

Tool definitions, JSON schemas, or structured-output formats change order or include dynamic constants.
Search for:
json.dumps
without
sort_keys=True
, dict/set-derived tool lists, registry iteration without sorting, dynamic
requestId
or timestamps in
response_format
/ JSON schema.
Fix:
python
tools = sorted(registry.values(), key=lambda t: t["function"]["name"])
schema_text = json.dumps(response_format, ensure_ascii=False, sort_keys=True)
Keep
request_id
, tenant, trace, and telemetry outside cacheable schemas.
工具定义、JSON架构或结构化输出格式的顺序变化,或包含动态常量。
搜索:未使用
sort_keys=True
json.dumps
、字典/集合派生的工具列表、未排序的注册表迭代、
response_format
/JSON架构中的动态
requestId
或时间戳。
修复:
python
tools = sorted(registry.values(), key=lambda t: t["function"]["name"])
schema_text = json.dumps(response_format, ensure_ascii=False, sort_keys=True)
request_id
、租户、跟踪和遥测信息放在可缓存架构之外。

AP-3: Template, Whitespace, Media, Or SDK Drift

AP-3:模板、空白、媒体或SDK漂移

Multiple code paths render the same prompt differently.
Search for: duplicated system templates, manual string concatenation, inconsistent
.strip()
, different markdown wrappers, image
detail
changes, signed URLs with rotating query strings, URL vs base64 differences.
Fix: create one canonical render function for the cacheable prefix. Normalize whitespace and media parameters. Pin image representation and detail level where applicable.
多个代码路径渲染相同提示的方式不同。
搜索:重复的系统模板、手动字符串拼接、不一致的
.strip()
、不同的Markdown包装器、图片
detail
变更、带有旋转查询字符串的签名URL、URL与base64差异。
修复:为可缓存前缀创建一个标准渲染函数。标准化空白和媒体参数。在适用的情况下固定图片表示和细节级别。

AP-4: Dynamic Tool Set Inside Agent Loop

AP-4:Agent循环内的动态工具集

Changing the
tools
array between agent steps rewrites early prompt content. A shorter prompt can be more expensive if it destroys reuse for the growing trajectory.
Search for: per-step tool retrieval,
get_active_tools
,
select_tools
, dynamic MCP tool lists, feature-flagged tool inclusion, unordered subagent/tool descriptions.
Fix options:
  • Keep compact tools stable for the session and sort by name.
  • Use masking/constrained decoding in self-hosted inference.
  • Use provider mechanisms such as allowed-tools, tool search, or deferred loading only after checking current docs.
  • Route to fixed tool bundles before the agent loop for multi-domain apps.
Do not move tool definitions into user messages as a cache workaround unless the provider explicitly supports that pattern.
Agent步骤间更改
tools
数组会重写早期提示内容。更短的提示可能因破坏不断增长的轨迹的复用性而变得更昂贵。
搜索:每步工具检索、
get_active_tools
select_tools
、动态MCP工具列表、功能标志控制的工具包含、未排序的子Agent/工具描述。
修复选项:
  • 在会话中保持紧凑工具稳定,并按名称排序。
  • 在自托管推理中使用掩码/约束解码。
  • 仅在检查当前文档后使用供应商机制,如允许的工具、工具搜索或延迟加载。
  • 对于多域应用,在Agent循环前路由到固定工具包。
除非供应商明确支持该模式,否则请勿将工具定义移至用户消息中作为缓存解决方法。

AP-5: History Mutation Instead Of Append-Only Growth

AP-5:历史修改而非仅追加增长

Rewriting early messages breaks the prefix chain.
Search for:
summarize
,
compaction
,
truncate
,
messages.pop
,
del messages
, replacement of early turns, system prompt mutation mid-session.
Fix: preserve an anchor and mutate later content only.
python
def manage_context(messages, max_tokens):
    anchor = messages[:2]  # system + first stable turn
    tail = messages[2:]
    while tail and count_tokens(anchor + tail) > max_tokens:
        tail.pop(0)
    if count_tokens(anchor + tail) > max_tokens:
        raise ValueError("stable prefix anchor exceeds context budget")
    return anchor + tail
For agents: prefer raw history, then compact bulky tool results by preserving paths/IDs/URLs, and use lossy summarization only when compaction is insufficient. If the stable anchor alone exceeds the budget, do not silently drop it; choose a provider-specific strategy, split the route/tool bundle, or fail closed with a clear diagnostic.
重写早期消息会破坏前缀链。
搜索:
summarize
compaction
truncate
messages.pop
del messages
、替换早期对话轮次、会话中修改系统提示。
修复:保留锚点,仅修改后续内容。
python
def manage_context(messages, max_tokens):
    anchor = messages[:2]  # 系统消息 + 第一个稳定对话轮次
    tail = messages[2:]
    while tail and count_tokens(anchor + tail) > max_tokens:
        tail.pop(0)
    if count_tokens(anchor + tail) > max_tokens:
        raise ValueError("稳定前缀锚点超出上下文预算")
    return anchor + tail
对于Agent:优先保留原始历史,然后通过保留路径/ID/URL压缩庞大的工具结果,仅在压缩不足时使用有损摘要。若稳定锚点本身超出预算,请勿静默丢弃;选择供应商特定策略、拆分路由/工具包,或通过明确诊断关闭流程。

AP-6: Mode Switching Or Framework Injection Mutates The Prefix

AP-6:模式切换或框架注入修改前缀

Mode changes or framework metadata appear before the growing history.
Search for: plan/debug/read-only modes implemented by swapping system prompts or tool lists, framework-injected
run_id
,
trace_id
, timestamps,
cwd
, platform, git status, or user/session metadata in the cacheable prefix.
Fix: keep base instructions and tool definitions stable. Put mode state and dynamic environment facts later in messages or non-cacheable metadata when the provider supports it.
模式变更或框架元数据出现在不断增长的历史之前。
搜索:通过切换系统提示或工具列表实现的计划/调试/只读模式、框架注入的
run_id
trace_id
、时间戳、
cwd
、平台、git状态或用户/会话元数据出现在可缓存前缀中。
修复:保持基础指令和工具定义稳定。当供应商支持时,将模式状态和动态环境事实放在消息的后面或不可缓存的元数据中。

AP-7: Cache-Blind Routing Across Replicas

AP-7:跨副本的缓存盲路由

Stable prompts still miss if requests reach different machines.
Search for: k8s Services with multiple replicas, round-robin gateways, autoscaling LLM pods, multiple vLLM/SGLang replicas, no sticky/prefix-aware routing.
Fix depends on stack:
  • Managed API: use provider-supported cache key or routing hint only after checking docs.
  • OpenRouter: inspect sticky routing,
    provider.order
    ,
    provider.only
    ,
    provider.ignore
    , fallback/model routing, and first-message conversation identity.
  • vLLM/SGLang/self-hosted: use prefix-aware routing, consistent hashing, or a gateway that hashes the stable prefix.
  • Minimum viable: route by stable prefix family while monitoring hot spots.
即使提示稳定,若请求到达不同机器仍会出现缓存未命中。
搜索:带有多个副本的k8s服务、轮询网关、自动扩缩容的LLM Pod、多个vLLM/SGLang副本、无粘性/前缀感知路由。
修复取决于技术栈:
  • 托管API:仅在检查文档后使用供应商支持的缓存键或路由提示。
  • OpenRouter:检查粘性路由、
    provider.order
    provider.only
    provider.ignore
    、fallback/模型路由和第一条消息的对话标识。
  • vLLM/SGLang/自托管:使用前缀感知路由、一致性哈希,或对稳定前缀进行哈希的网关。
  • 最小可行方案:按稳定前缀族路由,同时监控热点。

AP-8: Parallel Fan-Out On A Cold Prefix

AP-8:冷前缀上的并行扇出

Concurrent requests sharing a prefix can all pay full prefill if they start before cache creation is visible.
Search for:
asyncio.gather
,
Promise.all
,
ThreadPoolExecutor
, batch LLM calls, map/reduce fan-out without warm-up.
Fix:
python
await warm_cache(shared_prefix, max_tokens=1, tools=[])
results = await asyncio.gather(*[call(shared_prefix, q) for q in batch])
Warm-up must be safe: disable tools or constrain them to read-only/no-op behavior, avoid prompts that can mutate external state, and verify the provider exposes cache creation before relying on this pattern. If no safe warm-up call exists, skip warm-up and report the trade-off. Verify with usage metadata on the second wave, not just latency.
共享前缀的并发请求若在缓存创建可见前启动,可能都需支付完整的预填充成本。
搜索:
asyncio.gather
Promise.all
ThreadPoolExecutor
、批量LLM调用、无预热的 map/reduce 扇出。
修复:
python
await warm_cache(shared_prefix, max_tokens=1, tools=[])
results = await asyncio.gather(*[call(shared_prefix, q) for q in batch])
预热必须安全:禁用工具或限制为只读/无操作行为,避免可能修改外部状态的提示,并在依赖此模式前验证供应商是否暴露缓存创建。若无安全的预热调用,跳过预热并报告权衡。使用第二波请求的使用元数据进行验证,而非仅依赖延迟。

AP-9: Cache Lifetime, KV Budget, Or Eviction Mismatch

AP-9:缓存生命周期、KV预算或驱逐不匹配

The prefix is stable but cache entries expire or get evicted before reuse.
Search for: sparse traffic, batch windows separated by long pauses, large number of prefix families, overlarge
max_model_len
, low KV cache capacity, eviction metrics, available GPU blocks near zero.
Fix: match TTL/retention to traffic cadence for managed APIs. For self-hosted inference, size KV cache for the working set, not the theoretical model context window.
前缀稳定,但缓存条目在被复用前已过期或被驱逐。
搜索:稀疏流量、长时间间隔的批处理窗口、大量前缀族、过大的
max_model_len
、低KV缓存容量、驱逐指标、可用GPU块接近零。
修复:对于托管API,将TTL/保留期与流量频率匹配。对于自托管推理,根据工作集大小调整KV缓存,而非理论模型上下文窗口。

AP-9b: Over-Isolation Fragments Shared Prefixes

AP-9b:过度隔离导致共享前缀碎片化

Security or tenant isolation can intentionally prevent reuse across users. This may be correct, but it should be an explicit trade-off.
Search for: per-request
cache_salt
, per-user cache keys, tenant-specific routing keys,
user_id
in cache key, cache namespace by session, full cache isolation flags.
Fix: choose the coarsest safe trust boundary. Prefer route/team/tenant prefixes only when the data isolation model allows it; use per-user isolation when compliance or side-channel risk requires it. Report the expected cache-efficiency loss instead of treating it as a bug.
安全或租户隔离可能有意阻止跨用户复用。这可能是正确的,但应是明确的权衡。
搜索:每请求
cache_salt
、每用户缓存键、租户特定路由键、缓存键中的
user_id
、按会话划分的缓存命名空间、完全缓存隔离标志。
修复:选择最粗粒度的安全信任边界。仅当数据隔离模型允许时,优先使用路由/团队/租户前缀;当合规或侧信道风险要求时,使用每用户隔离。报告预期的缓存效率损失,而非将其视为漏洞。

AP-10: Experiment Or Config Fragmentation

AP-10:实验或配置碎片化

A/B tests, prompt variants, model settings, managed-router settings, or reasoning/tool settings split reuse into many small caches.
Search for:
variant
,
experiment
,
feature_flag
, prompt version per request, changing reasoning effort, changing tool choice, random few-shot examples,
openrouter/auto
, multiple
models
,
provider.order
, provider fallback settings.
Fix: test sequentially where possible, move differences after the stable prefix, or bucket by stable route/version and measure each bucket separately.
A/B测试、提示变体、模型设置、托管路由设置或推理/工具设置将复用拆分为多个小缓存。
搜索:
variant
experiment
feature_flag
、每请求提示版本、变化的推理力度、变化的工具选择、随机少样本示例、
openrouter/auto
、多个
models
provider.order
、供应商fallback设置。
修复:尽可能按顺序测试,将差异放在稳定前缀之后,或按稳定路由/版本分组并分别测量每个分组。

Agent Tool Stability Checks

Agent工具稳定性检查

Run these whenever the app is an agent, coding assistant, MCP client, tool-using assistant, or multi-step workflow.
  • Log
    cached_tokens
    /
    cache_read_input_tokens
    on each step.
  • Log
    prefix_hash
    for canonical
    system + tools + stable early messages
    .
  • Log
    tools_count
    and a sorted list/hash of tool names.
  • Log output tokens and streaming timestamps when latency is the symptom.
  • Check whether cache drops exactly when the tool list changes.
  • Confirm tool descriptions are fixed at session start or loaded via provider-supported deferred mechanisms.
  • Confirm compaction does not rewrite
    system + tools + first messages
    .
  • Confirm framework metadata is not injected before the cacheable prefix.
  • For managed routers such as OpenRouter, log actual model/provider route when available.
Use this diagnostic helper when project-specific tokenization is unavailable:
python
import hashlib
import json


def prefix_fingerprint(system, tools=None, response_format=None, early_messages=None):
    payload = {
        "system": system,
        "tools": tools or [],
        "response_format": response_format,
        "early_messages": early_messages or [],
    }
    text = json.dumps(payload, ensure_ascii=False, sort_keys=True)
    return hashlib.sha256(text.encode()).hexdigest()[:12]
This is a guardrail, not a provider tokenizer. Provider usage metadata remains the source of truth. For production telemetry or tenant/user-derived prompt fingerprints, use a keyed hash such as HMAC-SHA256 rather than a bare digest.
当应用是Agent、编码助手、MCP客户端、使用工具的助手或多步骤工作流时,执行以下检查。
  • 记录每步的
    cached_tokens
    /
    cache_read_input_tokens
  • 记录标准
    system + tools + 稳定早期消息
    prefix_hash
  • 记录
    tools_count
    和排序后的工具名称列表/哈希。
  • 当延迟是症状时,记录输出令牌和流时间戳。
  • 检查缓存是否在工具列表变更时恰好下降。
  • 确认工具描述在会话开始时固定,或通过供应商支持的延迟机制加载。
  • 确认压缩未重写
    system + tools + 第一条消息
  • 确认框架元数据未注入可缓存前缀之前。
  • 对于OpenRouter等托管路由,记录实际模型/供应商路由(若可用)。
当项目特定令牌化不可用时,使用以下诊断助手:
python
import hashlib
import json


def prefix_fingerprint(system, tools=None, response_format=None, early_messages=None):
    payload = {
        "system": system,
        "tools": tools or [],
        "response_format": response_format,
        "early_messages": early_messages or [],
    }
    text = json.dumps(payload, ensure_ascii=False, sort_keys=True)
    return hashlib.sha256(text.encode()).hexdigest()[:12]
这是一个防护措施,而非供应商令牌器。供应商使用元数据仍是权威来源。对于生产遥测或租户/用户派生的提示指纹,使用带密钥的哈希如HMAC-SHA256而非裸摘要。

Report Format

报告格式

Default to terse findings first. Use the evidence-bearing format for actionable findings:
text
source | severity | provider/engine | issue | evidence | evidence_type | confidence | impact_condition | cache impact | safe_first_action | fix | validation | do_not_do_yet
When space is tight, the legacy compact format is acceptable if the surrounding prose still states evidence, confidence, and what not to do yet:
text
file:line | severity | provider/engine | issue | cache impact | fix | validation
When structure is the issue, include compact before/after prompt layout.
markdown
undefined
默认先给出简洁的发现。对可执行发现使用带证据的格式:
text
source | severity | provider/engine | issue | evidence | evidence_type | confidence | impact_condition | cache impact | safe_first_action | fix | validation | do_not_do_yet
当空间有限时,若周围 prose 仍说明证据、置信度和当前不应执行的操作,可使用传统的紧凑格式:
text
file:line | severity | provider/engine | issue | cache impact | fix | validation
当结构存在问题时,包含紧凑的提示布局前后对比。
markdown
undefined

Prefix Cache Audit Report

前缀缓存审计报告

Provider/engine: ... Mode: code audit / advisory / agent audit Provider facts: verified on YYYY-MM-DD / unverified Measurement change: yes / no / unknown Prompt behavior change: yes / no / pilot only / unknown Provider/routing change: yes / no / not yet / unknown Confidence: high / medium / low Do first: ... Do not do yet: ...
供应商/引擎: ... 模式: 代码审计 / 咨询建议 / Agent审计 供应商信息: YYYY-MM-DD验证 / 未验证 测量变更: 是 / 否 / 未知 提示行为变更: 是 / 否 / 仅试点 / 未知 供应商/路由变更: 是 / 否 / 暂不执行 / 未知 置信度: 高 / 中等 / 低 优先执行: ... 暂不执行: ...

Findings

发现

path/to/file.py:42 | medium | OpenAI | tool schema order changes between calls | registry iteration is unsorted in the request builder | confirmed from code | medium-high | matters on hot repeated routes with long tool schemas | every order change can invalidate downstream prefix reuse | add tool/schema hash logging first | sort tools by stable name and serialize schemas with sorted keys | compare prefix fingerprints across three requests and confirm cached-token fields increase | do not change provider routing yet
path/to/file.py:42 | 中等 | OpenAI | 工具架构顺序在调用间变化 | 请求构建器中的注册表迭代未排序 | 代码确认 | 中高 | 在高频重复且工具架构较长的路由中重要 | 每次顺序变更都会使下游前缀复用失效 | 先添加工具/架构哈希日志 | 按稳定名称排序工具并使用排序键序列化架构 | 对比三个请求的前缀指纹,确认cached_token字段增加 | 暂不修改供应商路由

Clean Checks

合规检查

  • AP-1 volatile prefix data: clean
  • AP-7 routing: not applicable, single managed endpoint
  • AP-1 易变前缀数据: 合规
  • AP-7 路由: 不适用,单一托管端点

Monitoring

监控

  • cache ratio definition for this provider
  • prefill/TTFT vs decode/output split
  • prefix hash dimensions
  • deploy/change correlation to watch
undefined
  • 此供应商的缓存比率定义
  • 预填充/TTFT vs 解码/输出拆分
  • 前缀哈希维度
  • 需监控的部署/变更关联
undefined

Agent-First Quality Bar

Agent优先质量标准

Before finalizing an audit response:
  • Answer the decision the user asked for: change needed, no change, or evidence missing.
  • Prefer wrapper/router references over generic provider references when both signals exist.
  • Do not make exact provider claims without loading the relevant reference and applying the Freshness Gate.
  • Distinguish cache miss, cache write-without-read, uneconomic cache hit, decode-bound latency, rate-limit pressure, and privacy-driven isolation.
  • Include validation that can falsify the recommendation: prefix fingerprints, provider usage fields, route/replica metrics, or cost/latency split.
  • Do not propose cache controls, cache keys, or routing hints when the Not worth caching contract applies.
在最终确定审计响应前:
  • 响应用户询问的决策:需要变更、无需变更或缺少证据。
  • 当同时存在包装器/路由和通用供应商信号时,优先使用包装器/路由参考。
  • 未加载相关参考并应用新鲜度校验前,不做出明确的供应商声明。
  • 区分缓存未命中、仅写入未读取缓存、不经济的缓存命中、解码主导延迟、速率限制压力和隐私驱动的隔离。
  • 包含可证伪建议的验证:前缀指纹、供应商使用字段、路由/副本指标或成本/延迟拆分。
  • 当“无需缓存”约定适用时,不提议缓存控制、缓存键或路由提示。

Verification

验证

Do not claim a fix works until one of these is true:
  • Prefix-stability fixes: rendered cacheable prefix fingerprint is unchanged across different users/timestamps/queries.
  • Provider fixes: repeated calls show cache-read/cached-token fields increasing according to the provider reference.
  • Routing fixes: repeated prefix families land on the intended route and cache metrics improve by route.
  • vLLM/self-hosted fixes: prefix cache hit metrics and KV block pressure metrics improve under a representative workload.
Recommend a CI/smoke check that renders representative prompts and fails when the cacheable prefix changes unexpectedly.
在以下任一情况成立前,请勿声称修复有效:
  • 前缀稳定性修复:不同用户/时间戳/查询的可缓存前缀渲染指纹未变化。
  • 供应商修复:重复调用显示缓存读取/cached_token字段按供应商参考增加。
  • 路由修复:重复前缀族路由到预期路径,且路由级缓存指标改善。
  • vLLM/自托管修复:在代表性工作负载下,前缀缓存命中指标和KV块压力指标改善。
建议添加CI/冒烟检查:渲染代表性提示,当可缓存前缀意外变更时失败。

Advisory Questions

咨询问题

If no codebase is available, ask only the missing questions needed to diagnose:
  1. Which provider or inference engine?
  2. Is this a cost/migration, prompt/code, agent, deployment, or observability/CI audit?
  3. What artifacts are available: request code, rendered prompts, usage logs, deployment config, dashboards, evals?
  4. What are median/p95 input tokens, static prefix tokens, output tokens, and agent steps?
  5. What cache usage fields are visible in responses?
  6. Are there multiple replicas or gateways?
  7. Are tools/schemas stable across requests and agent steps?
  8. Is history append-only, compacted, or summarized?
  9. Are cache keys, salts, or routing hints per-user/per-request or shared by prefix family?
  10. What changed before the cache hit rate or TTFT regressed?
若无代码库可用,仅询问诊断所需的缺失问题:
  1. 使用的供应商或推理引擎是什么?
  2. 这是成本/迁移、提示/代码、Agent、部署还是可观测性/CI审计?
  3. 有哪些可用工件:请求代码、渲染后的提示、使用日志、部署配置、仪表盘、评估?
  4. 输入令牌、静态前缀令牌、输出令牌和Agent步骤的中位数/p95值是多少?
  5. 响应中可见哪些缓存使用字段?
  6. 是否有多个副本或网关?
  7. 工具/架构在请求和Agent步骤间是否稳定?
  8. 历史是仅追加、压缩还是摘要?
  9. 缓存键、盐或路由提示是每用户/每请求还是按前缀族共享?
  10. 缓存命中率或TTFT退化前发生了什么变更?