monitoring-observability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Monitoring & Observability

监控与可观测性

Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in

rules/

loaded on-demand.

涵盖基础设施监控、LLM可观测性和质量漂移检测的全面模式。每个分类在

rules/

目录下都有独立的规则文件，可按需加载。

Quick Reference

快速参考

Category	Rules	Impact	When to Use
Infrastructure Monitoring	3	CRITICAL	Prometheus metrics, Grafana dashboards, alerting rules
LLM Observability	3	HIGH	Langfuse tracing, cost tracking, evaluation scoring
Drift Detection	3	HIGH	Statistical drift, quality regression, drift alerting
Silent Failures	3	HIGH	Tool skipping, quality degradation, loop/token spike alerting

Total: 12 rules across 4 categories

分类	规则数量	影响级别	使用场景
基础设施监控	3	关键	Prometheus指标、Grafana仪表盘、告警规则
LLM可观测性	3	高	Langfuse追踪、成本跟踪、评估评分
漂移检测	3	高	统计漂移、质量退化、漂移告警
静默故障	3	高	工具调用遗漏、质量下降、循环/令牌激增告警

总计：4个分类共12条规则

Quick Start

快速开始

python

undefined

python

undefined

Prometheus metrics with RED method

基于RED方法的Prometheus指标

from prometheus_client import Counter, Histogram

http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status']) http_duration = Histogram('http_request_duration_seconds', 'Request latency', buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])


```python

from prometheus_client import Counter, Histogram


```python

Langfuse LLM tracing

Langfuse LLM追踪

from langfuse import observe, get_client

@observe() async def analyze_content(content: str): get_client().update_current_trace( user_id="user_123", session_id="session_abc", tags=["production", "orchestkit"], ) return await llm.generate(content)


```python

from langfuse import observe, get_client


```python

PSI drift detection

PSI漂移检测

import numpy as np

psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")

undefined

import numpy as np

psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")

undefined

Infrastructure Monitoring

基础设施监控

Prometheus metrics, Grafana dashboards, and alerting for application health.

Rule	File	Key Pattern
Prometheus Metrics	`rules/monitoring-prometheus.md`	RED method, counters, histograms, cardinality
Grafana Dashboards	`rules/monitoring-grafana.md`	Golden Signals, SLO/SLI, health checks
Alerting Rules	`rules/monitoring-alerting.md`	Severity levels, grouping, escalation, fatigue prevention

用于应用健康状况的Prometheus指标、Grafana仪表盘和告警。

规则	文件	核心模式
Prometheus指标	`rules/monitoring-prometheus.md`	RED方法、计数器、直方图、基数
Grafana仪表盘	`rules/monitoring-grafana.md`	黄金信号、SLO/SLI、健康检查
告警规则	`rules/monitoring-alerting.md`	严重级别、分组、升级、防告警疲劳

LLM Observability

LLM可观测性

Langfuse-based tracing, cost tracking, and evaluation for LLM applications.

Rule	File	Key Pattern
Langfuse Traces	`rules/llm-langfuse-traces.md`	@observe decorator, OTEL spans, agent graphs
Cost Tracking	`rules/llm-cost-tracking.md`	Token usage, spend alerts, Metrics API
Eval Scoring	`rules/llm-eval-scoring.md`	Custom scores, evaluator tracing, quality monitoring

基于Langfuse的LLM应用追踪、成本跟踪和评估。

规则	文件	核心模式
Langfuse追踪	`rules/llm-langfuse-traces.md`	@observe装饰器、OTEL链路、Agent图谱
成本跟踪	`rules/llm-cost-tracking.md`	令牌使用、支出告警、Metrics API
评估评分	`rules/llm-eval-scoring.md`	自定义评分、评估器追踪、质量监控

Drift Detection

漂移检测

Statistical and quality drift detection for production LLM systems.

Rule	File	Key Pattern
Statistical Drift	`rules/drift-statistical.md`	PSI, KS test, KL divergence, EWMA
Quality Drift	`rules/drift-quality.md`	Score regression, baseline comparison, canary prompts
Drift Alerting	`rules/drift-alerting.md`	Dynamic thresholds, correlation, anti-patterns

生产环境LLM系统的统计和质量漂移检测。

规则	文件	核心模式
统计漂移	`rules/drift-statistical.md`	PSI、KS检验、KL散度、EWMA
质量漂移	`rules/drift-quality.md`	评分退化、基线对比、金丝雀提示词
漂移告警	`rules/drift-alerting.md`	动态阈值、关联分析、反模式

Silent Failures

静默故障

Detection and alerting for silent failures in LLM agents.

Rule	File	Key Pattern
Tool Skipping	`rules/silent-tool-skipping.md`	Expected vs actual tool calls, Langfuse traces
Quality Degradation	`rules/silent-degraded-quality.md`	Heuristics + LLM-as-judge, z-score baselines
Silent Alerting	`rules/silent-alerting.md`	Loop detection, token spikes, escalation workflow

LLM Agent中静默故障的检测与告警。

规则	文件	核心模式
工具调用遗漏	`rules/silent-tool-skipping.md`	预期与实际工具调用对比、Langfuse追踪
质量退化	`rules/silent-degraded-quality.md`	启发式+LLM-as-judge、z分数基线
静默告警	`rules/silent-alerting.md`	循环检测、令牌激增、升级流程

Key Decisions

关键决策

Decision	Recommendation	Rationale
Metric methodology	RED method (Rate, Errors, Duration)	Industry standard, covers essential service health
Log format	Structured JSON	Machine-parseable, supports log aggregation
Tracing	OpenTelemetry	Vendor-neutral, auto-instrumentation, broad ecosystem
LLM observability	Langfuse (not LangSmith)	Open-source, self-hosted, built-in prompt management
LLM tracing API	`@observe` + `get_client()`	OTEL-native, automatic span creation
Drift method	PSI for production, KS for small samples	PSI is stable for large datasets, KS more sensitive
Threshold strategy	Dynamic (95th percentile) over static	Reduces alert fatigue, context-aware
Alert severity	4 levels (Critical, High, Medium, Low)	Clear escalation paths, appropriate response times

决策	推荐方案	理由
指标方法论	RED方法（Rate、Errors、Duration）	行业标准，覆盖核心服务健康状况
日志格式	结构化JSON	机器可解析，支持日志聚合
追踪	OpenTelemetry	厂商中立、自动插桩、广泛的生态系统
LLM可观测性	Langfuse（非LangSmith）	开源、自托管、内置提示词管理
LLM追踪API	`@observe` + `get_client()`	原生支持OTEL、自动创建链路
漂移检测方法	生产环境用PSI，小样本用KS	PSI在大数据集下稳定，KS更敏感
阈值策略	动态阈值（95百分位）优于静态阈值	减少告警疲劳，上下文感知
告警严重级别	4个级别（Critical、High、Medium、Low）	清晰的升级路径，匹配响应时间

Detailed Documentation

详细文档

Resource	Description
references/	Logging, metrics, tracing, Langfuse, drift analysis guides
checklists/	Implementation checklists for monitoring and Langfuse setup
examples/	Real-world monitoring dashboard and trace examples
scripts/	Templates: Prometheus, OpenTelemetry, health checks, Langfuse

资源	描述
references/	日志、指标、追踪、Langfuse、漂移分析指南
checklists/	监控和Langfuse设置的实施清单
examples/	真实场景的监控仪表盘和追踪示例
scripts/	模板：Prometheus、OpenTelemetry、健康检查、Langfuse