monitoring-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMonitoring & Observability
监控与可观测性
Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in loaded on-demand.
rules/涵盖基础设施监控、LLM可观测性和质量漂移检测的全面模式。每个分类在目录下都有独立的规则文件,可按需加载。
rules/Quick Reference
快速参考
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Infrastructure Monitoring | 3 | CRITICAL | Prometheus metrics, Grafana dashboards, alerting rules |
| LLM Observability | 3 | HIGH | Langfuse tracing, cost tracking, evaluation scoring |
| Drift Detection | 3 | HIGH | Statistical drift, quality regression, drift alerting |
| Silent Failures | 3 | HIGH | Tool skipping, quality degradation, loop/token spike alerting |
Total: 12 rules across 4 categories
Quick Start
快速开始
python
undefinedpython
undefinedPrometheus metrics with RED method
基于RED方法的Prometheus指标
from prometheus_client import Counter, Histogram
http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
http_duration = Histogram('http_request_duration_seconds', 'Request latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])
```pythonfrom prometheus_client import Counter, Histogram
http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
http_duration = Histogram('http_request_duration_seconds', 'Request latency',
buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])
```pythonLangfuse LLM tracing
Langfuse LLM追踪
from langfuse import observe, get_client
@observe()
async def analyze_content(content: str):
get_client().update_current_trace(
user_id="user_123", session_id="session_abc",
tags=["production", "orchestkit"],
)
return await llm.generate(content)
```pythonfrom langfuse import observe, get_client
@observe()
async def analyze_content(content: str):
get_client().update_current_trace(
user_id="user_123", session_id="session_abc",
tags=["production", "orchestkit"],
)
return await llm.generate(content)
```pythonPSI drift detection
PSI漂移检测
import numpy as np
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
alert("Significant quality drift detected!")
undefinedimport numpy as np
psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
alert("Significant quality drift detected!")
undefinedInfrastructure Monitoring
基础设施监控
Prometheus metrics, Grafana dashboards, and alerting for application health.
| Rule | File | Key Pattern |
|---|---|---|
| Prometheus Metrics | | RED method, counters, histograms, cardinality |
| Grafana Dashboards | | Golden Signals, SLO/SLI, health checks |
| Alerting Rules | | Severity levels, grouping, escalation, fatigue prevention |
用于应用健康状况的Prometheus指标、Grafana仪表盘和告警。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| Prometheus指标 | | RED方法、计数器、直方图、基数 |
| Grafana仪表盘 | | 黄金信号、SLO/SLI、健康检查 |
| 告警规则 | | 严重级别、分组、升级、防告警疲劳 |
LLM Observability
LLM可观测性
Langfuse-based tracing, cost tracking, and evaluation for LLM applications.
| Rule | File | Key Pattern |
|---|---|---|
| Langfuse Traces | | @observe decorator, OTEL spans, agent graphs |
| Cost Tracking | | Token usage, spend alerts, Metrics API |
| Eval Scoring | | Custom scores, evaluator tracing, quality monitoring |
基于Langfuse的LLM应用追踪、成本跟踪和评估。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| Langfuse追踪 | | @observe装饰器、OTEL链路、Agent图谱 |
| 成本跟踪 | | 令牌使用、支出告警、Metrics API |
| 评估评分 | | 自定义评分、评估器追踪、质量监控 |
Drift Detection
漂移检测
Statistical and quality drift detection for production LLM systems.
| Rule | File | Key Pattern |
|---|---|---|
| Statistical Drift | | PSI, KS test, KL divergence, EWMA |
| Quality Drift | | Score regression, baseline comparison, canary prompts |
| Drift Alerting | | Dynamic thresholds, correlation, anti-patterns |
生产环境LLM系统的统计和质量漂移检测。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 统计漂移 | | PSI、KS检验、KL散度、EWMA |
| 质量漂移 | | 评分退化、基线对比、金丝雀提示词 |
| 漂移告警 | | 动态阈值、关联分析、反模式 |
Silent Failures
静默故障
Detection and alerting for silent failures in LLM agents.
| Rule | File | Key Pattern |
|---|---|---|
| Tool Skipping | | Expected vs actual tool calls, Langfuse traces |
| Quality Degradation | | Heuristics + LLM-as-judge, z-score baselines |
| Silent Alerting | | Loop detection, token spikes, escalation workflow |
LLM Agent中静默故障的检测与告警。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 工具调用遗漏 | | 预期与实际工具调用对比、Langfuse追踪 |
| 质量退化 | | 启发式+LLM-as-judge、z分数基线 |
| 静默告警 | | 循环检测、令牌激增、升级流程 |
Key Decisions
关键决策
| Decision | Recommendation | Rationale |
|---|---|---|
| Metric methodology | RED method (Rate, Errors, Duration) | Industry standard, covers essential service health |
| Log format | Structured JSON | Machine-parseable, supports log aggregation |
| Tracing | OpenTelemetry | Vendor-neutral, auto-instrumentation, broad ecosystem |
| LLM observability | Langfuse (not LangSmith) | Open-source, self-hosted, built-in prompt management |
| LLM tracing API | | OTEL-native, automatic span creation |
| Drift method | PSI for production, KS for small samples | PSI is stable for large datasets, KS more sensitive |
| Threshold strategy | Dynamic (95th percentile) over static | Reduces alert fatigue, context-aware |
| Alert severity | 4 levels (Critical, High, Medium, Low) | Clear escalation paths, appropriate response times |
| 决策 | 推荐方案 | 理由 |
|---|---|---|
| 指标方法论 | RED方法(Rate、Errors、Duration) | 行业标准,覆盖核心服务健康状况 |
| 日志格式 | 结构化JSON | 机器可解析,支持日志聚合 |
| 追踪 | OpenTelemetry | 厂商中立、自动插桩、广泛的生态系统 |
| LLM可观测性 | Langfuse(非LangSmith) | 开源、自托管、内置提示词管理 |
| LLM追踪API | | 原生支持OTEL、自动创建链路 |
| 漂移检测方法 | 生产环境用PSI,小样本用KS | PSI在大数据集下稳定,KS更敏感 |
| 阈值策略 | 动态阈值(95百分位)优于静态阈值 | 减少告警疲劳,上下文感知 |
| 告警严重级别 | 4个级别(Critical、High、Medium、Low) | 清晰的升级路径,匹配响应时间 |
Detailed Documentation
详细文档
| Resource | Description |
|---|---|
| references/ | Logging, metrics, tracing, Langfuse, drift analysis guides |
| checklists/ | Implementation checklists for monitoring and Langfuse setup |
| examples/ | Real-world monitoring dashboard and trace examples |
| scripts/ | Templates: Prometheus, OpenTelemetry, health checks, Langfuse |
| 资源 | 描述 |
|---|---|
| references/ | 日志、指标、追踪、Langfuse、漂移分析指南 |
| checklists/ | 监控和Langfuse设置的实施清单 |
| examples/ | 真实场景的监控仪表盘和追踪示例 |
| scripts/ | 模板:Prometheus、OpenTelemetry、健康检查、Langfuse |
Related Skills
相关技能
- - Layer 8 observability as part of security architecture
defense-in-depth - - Observability integration with CI/CD and Kubernetes
devops-deployment - - Monitoring circuit breakers and failure scenarios
resilience-patterns - - Evaluation patterns that integrate with Langfuse scoring
llm-evaluation - - Caching strategies that reduce costs tracked by Langfuse
caching
- - 作为安全架构一部分的第8层可观测性
defense-in-depth - - 与CI/CD和Kubernetes集成的可观测性
devops-deployment - - 监控断路器和故障场景
resilience-patterns - - 与Langfuse评分集成的评估模式
llm-evaluation - - 减少Langfuse跟踪的成本的缓存策略
caching