monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Monitoring & Observability

监控与可观测性

Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in
rules/
loaded on-demand.
涵盖基础设施监控、LLM可观测性和质量漂移检测的全面模式。每个分类在
rules/
目录下都有独立的规则文件,可按需加载。

Quick Reference

快速参考

CategoryRulesImpactWhen to Use
Infrastructure Monitoring3CRITICALPrometheus metrics, Grafana dashboards, alerting rules
LLM Observability3HIGHLangfuse tracing, cost tracking, evaluation scoring
Drift Detection3HIGHStatistical drift, quality regression, drift alerting
Silent Failures3HIGHTool skipping, quality degradation, loop/token spike alerting
Total: 12 rules across 4 categories
分类规则数量影响级别使用场景
基础设施监控3关键Prometheus指标、Grafana仪表盘、告警规则
LLM可观测性3Langfuse追踪、成本跟踪、评估评分
漂移检测3统计漂移、质量退化、漂移告警
静默故障3工具调用遗漏、质量下降、循环/令牌激增告警
总计:4个分类共12条规则

Quick Start

快速开始

python
undefined
python
undefined

Prometheus metrics with RED method

基于RED方法的Prometheus指标

from prometheus_client import Counter, Histogram
http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status']) http_duration = Histogram('http_request_duration_seconds', 'Request latency', buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])

```python
from prometheus_client import Counter, Histogram
http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status']) http_duration = Histogram('http_request_duration_seconds', 'Request latency', buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])

```python

Langfuse LLM tracing

Langfuse LLM追踪

from langfuse import observe, get_client
@observe() async def analyze_content(content: str): get_client().update_current_trace( user_id="user_123", session_id="session_abc", tags=["production", "orchestkit"], ) return await llm.generate(content)

```python
from langfuse import observe, get_client
@observe() async def analyze_content(content: str): get_client().update_current_trace( user_id="user_123", session_id="session_abc", tags=["production", "orchestkit"], ) return await llm.generate(content)

```python

PSI drift detection

PSI漂移检测

import numpy as np
psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")
undefined
import numpy as np
psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")
undefined

Infrastructure Monitoring

基础设施监控

Prometheus metrics, Grafana dashboards, and alerting for application health.
RuleFileKey Pattern
Prometheus Metrics
rules/monitoring-prometheus.md
RED method, counters, histograms, cardinality
Grafana Dashboards
rules/monitoring-grafana.md
Golden Signals, SLO/SLI, health checks
Alerting Rules
rules/monitoring-alerting.md
Severity levels, grouping, escalation, fatigue prevention
用于应用健康状况的Prometheus指标、Grafana仪表盘和告警。
规则文件核心模式
Prometheus指标
rules/monitoring-prometheus.md
RED方法、计数器、直方图、基数
Grafana仪表盘
rules/monitoring-grafana.md
黄金信号、SLO/SLI、健康检查
告警规则
rules/monitoring-alerting.md
严重级别、分组、升级、防告警疲劳

LLM Observability

LLM可观测性

Langfuse-based tracing, cost tracking, and evaluation for LLM applications.
RuleFileKey Pattern
Langfuse Traces
rules/llm-langfuse-traces.md
@observe decorator, OTEL spans, agent graphs
Cost Tracking
rules/llm-cost-tracking.md
Token usage, spend alerts, Metrics API
Eval Scoring
rules/llm-eval-scoring.md
Custom scores, evaluator tracing, quality monitoring
基于Langfuse的LLM应用追踪、成本跟踪和评估。
规则文件核心模式
Langfuse追踪
rules/llm-langfuse-traces.md
@observe装饰器、OTEL链路、Agent图谱
成本跟踪
rules/llm-cost-tracking.md
令牌使用、支出告警、Metrics API
评估评分
rules/llm-eval-scoring.md
自定义评分、评估器追踪、质量监控

Drift Detection

漂移检测

Statistical and quality drift detection for production LLM systems.
RuleFileKey Pattern
Statistical Drift
rules/drift-statistical.md
PSI, KS test, KL divergence, EWMA
Quality Drift
rules/drift-quality.md
Score regression, baseline comparison, canary prompts
Drift Alerting
rules/drift-alerting.md
Dynamic thresholds, correlation, anti-patterns
生产环境LLM系统的统计和质量漂移检测。
规则文件核心模式
统计漂移
rules/drift-statistical.md
PSI、KS检验、KL散度、EWMA
质量漂移
rules/drift-quality.md
评分退化、基线对比、金丝雀提示词
漂移告警
rules/drift-alerting.md
动态阈值、关联分析、反模式

Silent Failures

静默故障

Detection and alerting for silent failures in LLM agents.
RuleFileKey Pattern
Tool Skipping
rules/silent-tool-skipping.md
Expected vs actual tool calls, Langfuse traces
Quality Degradation
rules/silent-degraded-quality.md
Heuristics + LLM-as-judge, z-score baselines
Silent Alerting
rules/silent-alerting.md
Loop detection, token spikes, escalation workflow
LLM Agent中静默故障的检测与告警。
规则文件核心模式
工具调用遗漏
rules/silent-tool-skipping.md
预期与实际工具调用对比、Langfuse追踪
质量退化
rules/silent-degraded-quality.md
启发式+LLM-as-judge、z分数基线
静默告警
rules/silent-alerting.md
循环检测、令牌激增、升级流程

Key Decisions

关键决策

DecisionRecommendationRationale
Metric methodologyRED method (Rate, Errors, Duration)Industry standard, covers essential service health
Log formatStructured JSONMachine-parseable, supports log aggregation
TracingOpenTelemetryVendor-neutral, auto-instrumentation, broad ecosystem
LLM observabilityLangfuse (not LangSmith)Open-source, self-hosted, built-in prompt management
LLM tracing API
@observe
+
get_client()
OTEL-native, automatic span creation
Drift methodPSI for production, KS for small samplesPSI is stable for large datasets, KS more sensitive
Threshold strategyDynamic (95th percentile) over staticReduces alert fatigue, context-aware
Alert severity4 levels (Critical, High, Medium, Low)Clear escalation paths, appropriate response times
决策推荐方案理由
指标方法论RED方法(Rate、Errors、Duration)行业标准,覆盖核心服务健康状况
日志格式结构化JSON机器可解析,支持日志聚合
追踪OpenTelemetry厂商中立、自动插桩、广泛的生态系统
LLM可观测性Langfuse(非LangSmith)开源、自托管、内置提示词管理
LLM追踪API
@observe
+
get_client()
原生支持OTEL、自动创建链路
漂移检测方法生产环境用PSI,小样本用KSPSI在大数据集下稳定,KS更敏感
阈值策略动态阈值(95百分位)优于静态阈值减少告警疲劳,上下文感知
告警严重级别4个级别(Critical、High、Medium、Low)清晰的升级路径,匹配响应时间

Detailed Documentation

详细文档

ResourceDescription
references/Logging, metrics, tracing, Langfuse, drift analysis guides
checklists/Implementation checklists for monitoring and Langfuse setup
examples/Real-world monitoring dashboard and trace examples
scripts/Templates: Prometheus, OpenTelemetry, health checks, Langfuse
资源描述
references/日志、指标、追踪、Langfuse、漂移分析指南
checklists/监控和Langfuse设置的实施清单
examples/真实场景的监控仪表盘和追踪示例
scripts/模板:Prometheus、OpenTelemetry、健康检查、Langfuse

Related Skills

相关技能

  • defense-in-depth
    - Layer 8 observability as part of security architecture
  • devops-deployment
    - Observability integration with CI/CD and Kubernetes
  • resilience-patterns
    - Monitoring circuit breakers and failure scenarios
  • llm-evaluation
    - Evaluation patterns that integrate with Langfuse scoring
  • caching
    - Caching strategies that reduce costs tracked by Langfuse
  • defense-in-depth
    - 作为安全架构一部分的第8层可观测性
  • devops-deployment
    - 与CI/CD和Kubernetes集成的可观测性
  • resilience-patterns
    - 监控断路器和故障场景
  • llm-evaluation
    - 与Langfuse评分集成的评估模式
  • caching
    - 减少Langfuse跟踪的成本的缓存策略