Search Results: kv-cache

Found 9 Skills

context-optimization

This skill should be used when the user asks to "optimize context", "reduce token costs", "improve context efficiency", "implement KV-cache optimization", "partition context", or mentions context limits, observation masking, context budgeting, or extending effective context capacity. A core context engineering skill — also activates when the user mentions "context engineering" or "context-engineering" in the context of maximizing information density within token constraints.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningeyadsibai/ltk

context-optimization

Use when optimizing agent context, reducing token costs, implementing KV-cache optimization, or asking about "context optimization", "token reduction", "context limits", "observation masking", "context budgeting", "context partitioning"

🇺🇸|EnglishTranslated

AI & Machine Learningcharleswiltgen/axiom

axiom-ios-ml

Use when deploying ANY machine learning model on-device, converting models to CoreML, compressing models, or implementing speech-to-text. Covers CoreML conversion, MLTensor, model compression (quantization/palettization/pruning), stateful models, KV-cache, multi-function models, async prediction, SpeechAnalyzer, SpeechTranscriber.

🇺🇸|EnglishTranslated

AI & Machine Learningguanyang/antigravity-skil...

latent-briefing

This skill should be used when the user asks to "share memory between agents", "KV cache compaction for multi-agent", "orchestrator worker context", "latent briefing", "reduce worker tokens", "cross-agent memory without summarization", or discusses Attention Matching compaction, recursive language models with workers, or token explosion in hierarchical agents.

🇺🇸|EnglishTranslated

AI & Machine Learningshipshitdev/library

context-optimization

Apply optimization techniques to extend effective context capacity. Use when context limits constrain agent performance, when optimizing for cost or latency, or when implementing long-running agent systems.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningsernote/audit-prompt-cach...

audit-prompt-caching

Use whenever the user mentions LLM prompt/prefix cache misses, cached_tokens=0, cache_read_input_tokens/cache_creation_input_tokens, prompt_cache_key, cache_control/cachePoint placement, stable prefixes, tool/schema stability, TTFT/prefill latency, OpenAI/Claude/Bedrock/OpenRouter routing, vLLM/SGLang KV reuse, or LLM cost/speed regressions on repeated long prompts. Use when reviewing LLM request shape changes: prompt text, message order, request builders, tools, schemas, response_format, provider API surface, model/router settings, agent loop structure, context compaction, or inference deployment. Use for speeding up agents only when prompt-cache stability, TTFT, or cache cost is central. Do not use for generic prompt writing, generic RAG design, token counting, or non-LLM performance.

🇺🇸|EnglishTranslated

8 scripts/Attention

AI & Machine Learningsickn33/antigravity-aweso...

context-optimization

Apply compaction, masking, and caching strategies

🇺🇸|EnglishTranslated

AI & Machine Learningslowlyc/agent-gpu-skills

sglang-skill

Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer, MLA, MoE routing, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, or asks about serving LLMs with SGLang.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningbbuf/sglang-auto-driven-s...

llm-serving-capacity-planner

Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

🇺🇸|EnglishTranslated

1 scripts/Checked