dt-obs-services

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Application Services Skill

应用服务Skill

Monitor application service performance, health, and runtime-specific metrics using DQL.

使用DQL监控应用服务的性能、健康状态以及特定运行时指标。

Core Capabilities

核心能力

1. Service Performance (RED Metrics)

1. 服务性能(RED指标)

Monitor service Rate, Errors, Duration using metrics-based timeseries queries.
Key Metrics:
  • dt.service.request.response_time
    - Response time (microseconds)
  • dt.service.request.count
    - Request count
  • dt.service.request.failure_count
    - Failed request count
Common Use Cases:
  • Response time monitoring (avg, p50, p95, p99)
  • Error rate tracking and spike detection
  • Traffic analysis (throughput, peaks, growth)
  • Performance degradation detection
  • Multi-cluster comparison
Quick Example:
dql
timeseries {
  p95 = percentile(dt.service.request.response_time, 95),
  total_requests = sum(dt.service.request.count),
  failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]
For detailed queries: See references/service-metrics.md
使用基于指标的时序查询监控服务的请求速率(Rate)、错误数(Errors)、响应时长(Duration)
关键指标:
  • dt.service.request.response_time
    - 响应时间(微秒)
  • dt.service.request.count
    - 请求总数
  • dt.service.request.failure_count
    - 请求失败数
常见用例:
  • 响应时间监控(平均值、p50、p95、p99分位值)
  • 错误率追踪与峰值检测
  • 流量分析(吞吐量、峰值、增长趋势)
  • 性能降级检测
  • 多集群对比
快速示例:
dql
timeseries {
  p95 = percentile(dt.service.request.response_time, 95),
  total_requests = sum(dt.service.request.count),
  failures = sum(dt.service.request.failure_count)
}, by: {dt.service.name}
| fieldsAdd p95_ms = p95[] / 1000, error_rate_pct = (failures[] * 100.0) / total_requests[]
查看详细查询: 参考 references/service-metrics.md

2. Advanced Service Analysis

2. 高级服务分析

Span-based queries for complex scenarios requiring flexible filtering and custom aggregations.
Use Cases:
  • SLA compliance tracking with custom thresholds
  • Service health scoring (multi-dimensional)
  • Operation/endpoint-level performance analysis
  • Custom error classification
  • Failure pattern detection with error details
Quick Example:
dql
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total
For detailed queries: See references/service-metrics.md
基于Span的查询适用于需要灵活过滤和自定义聚合的复杂场景。
用例:
  • 基于自定义阈值的SLA合规跟踪
  • 多维度服务健康评分
  • 操作/接口级别的性能分析
  • 自定义错误分类
  • 结合错误详情的失败模式检测
快速示例:
dql
fetch spans, from: now() - 1h | filter request.is_root_span == true
| fieldsAdd meets_sla = if(request.is_failed == false AND duration < 3s, 1, else: 0)
| summarize total = count(), sla_compliant = sum(meets_sla), by: {dt.service.name}
| fieldsAdd sla_compliance_pct = (sla_compliant * 100.0) / total
查看详细查询: 参考 references/service-metrics.md

3. Service Messaging Metrics

3. 服务消息指标

Monitor message-based service communication (queues, topics).
Key Metrics:
  • dt.service.messaging.publish.count
    - Messages sent to queues or topics
  • dt.service.messaging.receive.count
    - Messages received from queues or topics
  • dt.service.messaging.process.count
    - Messages successfully processed
  • dt.service.messaging.process.failure_count
    - Messages that failed processing
Use Cases:
  • Message throughput monitoring (publish/receive rates)
  • Message processing failure tracking
  • Queue/topic health analysis
  • Consumer lag detection (publish vs receive rate comparison)
Quick Example:
dql
timeseries {
  published = sum(dt.service.messaging.publish.count),
  received = sum(dt.service.messaging.receive.count),
  processed = sum(dt.service.messaging.process.count),
  failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}
For detailed queries: See references/service-metrics.md
监控基于消息的服务通信(队列、主题)。
关键指标:
  • dt.service.messaging.publish.count
    - 发送到队列或主题的消息数
  • dt.service.messaging.receive.count
    - 从队列或主题接收的消息数
  • dt.service.messaging.process.count
    - 成功处理的消息数
  • dt.service.messaging.process.failure_count
    - 处理失败的消息数
用例:
  • 消息吞吐量监控(发送/接收速率)
  • 消息处理失败追踪
  • 队列/主题健康分析
  • 消费者延迟检测(对比发送与接收速率)
快速示例:
dql
timeseries {
  published = sum(dt.service.messaging.publish.count),
  received = sum(dt.service.messaging.receive.count),
  processed = sum(dt.service.messaging.process.count),
  failed = sum(dt.service.messaging.process.failure_count)
}, by: {dt.service.name}
查看详细查询: 参考 references/service-metrics.md

4. Service Mesh Monitoring

4. 服务网格监控

Monitor service mesh ingress performance and overhead.
Key Metrics:
  • dt.service.request.service_mesh.response_time
    - Mesh response time (microseconds)
  • dt.service.request.service_mesh.count
    - Mesh request count
  • dt.service.request.service_mesh.failure_count
    - Mesh failure count
Use Cases:
  • Mesh vs direct performance comparison
  • Mesh overhead calculation
  • Mesh failure analysis
  • gRPC traffic monitoring
  • Multi-cluster mesh performance
Quick Example:
dql
timeseries {
  direct_p95 = percentile(dt.service.request.response_time, 95),
  mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000
For detailed queries: See references/service-metrics.md
监控服务网格入口性能与开销。
关键指标:
  • dt.service.request.service_mesh.response_time
    - 网格响应时间(微秒)
  • dt.service.request.service_mesh.count
    - 网格请求总数
  • dt.service.request.service_mesh.failure_count
    - 网格请求失败数
用例:
  • 网格请求与直连请求的性能对比
  • 网格开销计算
  • 网格故障分析
  • gRPC流量监控
  • 多集群网格性能对比
快速示例:
dql
timeseries {
  direct_p95 = percentile(dt.service.request.response_time, 95),
  mesh_p95 = percentile(dt.service.request.service_mesh.response_time, 95)
}, by: {dt.service.name}
| fieldsAdd mesh_overhead_ms = (mesh_p95[] - direct_p95[]) / 1000
查看详细查询: 参考 references/service-metrics.md

5. Runtime-Specific Monitoring

5. 特定运行时监控

Technology-specific runtime performance and resource usage metrics.
Java/JVM - references/java.md
  • Memory: heap, pools, metaspace
  • GC: impact, suspension, frequency, pause time
  • Threads: count monitoring, leak detection
  • Classes: loading, unloading, growth
Node.js - references/nodejs.md
  • Event loop: utilization, active handles
  • V8 heap: memory used, total
  • GC: collection time, suspension
  • Process: RSS memory
.NET CLR - references/dotnet.md
  • Memory: consumption by generation
  • GC: collection count, suspension time
  • Thread pool: threads, queued work
  • JIT: compilation time
Python - references/python.md
  • Threads: active thread count
  • Heap: allocated blocks
  • GC: collection by generation, pause time
  • Objects: collected, uncollectable
PHP - references/php.md
  • OPcache: hit ratio, memory, restarts
  • GC: effectiveness, duration
  • JIT: buffer usage
  • Interned strings: usage, buffer
Go - references/go.md
  • Goroutines: count, leak detection
  • GC: suspension, collection time
  • Memory: heap by state, committed
  • Scheduler: worker threads, queue size
  • CGo: call frequency

针对不同技术栈的运行时性能与资源使用指标。
Java/JVM - references/java.md
  • 内存:堆、内存池、元空间
  • GC:影响、暂停时长、频率、停顿时间
  • 线程:数量监控、泄漏检测
  • 类:加载、卸载、增长趋势
Node.js - references/nodejs.md
  • 事件循环:利用率、活跃句柄
  • V8堆:已用内存、总内存
  • GC:回收时间、暂停时长
  • 进程:RSS内存
.NET CLR - references/dotnet.md
  • 内存:按分代统计的消耗量
  • GC:回收次数、暂停时长
  • 线程池:线程数、排队任务
  • JIT:编译时间
Python - references/python.md
  • 线程:活跃线程数
  • 堆:已分配块
  • GC:分代回收、停顿时间
  • 对象:已回收、不可回收
PHP - references/php.md
  • OPcache:命中率、内存、重启次数
  • GC:有效性、持续时长
  • JIT:缓冲区使用率
  • 驻留字符串:使用量、缓冲区
Go - references/go.md
  • Goroutines:数量、泄漏检测
  • GC:暂停时长、回收时间
  • 内存:按状态统计的堆、已提交内存
  • 调度器:工作线程、队列大小
  • CGo:调用频率

When to Use This Skill

此Skill的适用场景

Use for:
  • Monitoring service performance (response time, errors, traffic)
  • Calculating SLA compliance
  • Analyzing service mesh performance
  • Monitoring messaging throughput and processing failures
  • Troubleshooting runtime-specific issues (GC, memory, threads)
  • Multi-cluster service comparison
  • Operation/endpoint-level analysis
Don't use for:
  • Infrastructure metrics (use infrastructure skills)
  • Log analysis (use logs skills)
  • Distributed tracing workflows (use traces/spans skills)
  • Database performance (use database skills)

适用于:
  • 监控服务性能(响应时间、错误、流量)
  • 计算SLA合规率
  • 分析服务网格性能
  • 监控消息吞吐量与处理失败情况
  • 排查特定运行时问题(GC、内存、线程)
  • 多集群服务对比
  • 操作/接口级别的分析
不适用于:
  • 基础设施指标监控(请使用基础设施相关Skill)
  • 日志分析(请使用日志相关Skill)
  • 分布式追踪工作流(请使用追踪/Span相关Skill)
  • 数据库性能监控(请使用数据库相关Skill)

Agent Instructions

Agent使用说明

Understanding User Intent

理解用户意图

Map user questions to capabilities:
User RequestUse CapabilityKey Files
"service performance", "response time", "error rate"Service Performance (RED)service-metrics.md
"SLA tracking", "health scoring"Advanced Service Analysisservice-metrics.md
"service mesh", "Istio", "Linkerd", "mesh overhead"Service Mesh Monitoringservice-metrics.md
"messaging", "queue", "topic", "publish", "consumer"Service Messaging Metricsservice-metrics.md
"JVM GC", "Java memory", "heap"Runtime-Specific (Java)java.md
"Node.js event loop", "V8 heap"Runtime-Specific (Node.js)nodejs.md
".NET CLR", "GC generation"Runtime-Specific (.NET)dotnet.md
"Python GC", "thread count"Runtime-Specific (Python)python.md
"OPcache", "PHP GC"Runtime-Specific (PHP)php.md
"goroutines", "Go GC", "scheduler"Runtime-Specific (Go)go.md
将用户问题映射到对应能力:
用户请求对应能力关联文件
"服务性能"、"响应时间"、"错误率"服务性能(RED指标)service-metrics.md
"SLA跟踪"、"健康评分"高级服务分析service-metrics.md
"服务网格"、"Istio"、"Linkerd"、"网格开销"服务网格监控service-metrics.md
"消息队列"、"队列"、"主题"、"发布"、"消费者"服务消息指标service-metrics.md
"JVM GC"、"Java内存"、"堆"特定运行时(Java)java.md
"Node.js事件循环"、"V8堆"特定运行时(Node.js)nodejs.md
".NET CLR"、"GC分代"特定运行时(.NET)dotnet.md
"Python GC"、"线程数"特定运行时(Python)python.md
"OPcache"、"PHP GC"特定运行时(PHP)php.md
"goroutines"、"Go GC"、"调度器"特定运行时(Go)go.md

Query Construction Patterns

查询构建模式

1. Metrics-based (timeseries)
  • Use for: Standard monitoring, dashboards, alerting
  • Pattern:
    timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions}
  • Files: service-metrics.md, all runtime-specific files
2. Span-based (fetch spans)
  • Use for: Complex filtering, custom logic, detailed analysis
  • Pattern:
    fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ...
  • Files: service-metrics.md (Advanced Service Analysis section)
3. Comparison queries
  • Use
    append
    for baseline comparison
  • Use
    shift: -15m
    for time-shifted baselines
  • Example: Performance degradation detection
1. 基于指标(时序)
  • 适用场景: 标准监控、仪表盘、告警
  • 模式:
    timeseries <metric> = <aggregation>(<metric_name>), by: {dimensions}
  • 关联文件: service-metrics.md、所有运行时相关文件
2. 基于Span(fetch spans)
  • 适用场景: 复杂过滤、自定义逻辑、详细分析
  • 模式:
    fetch spans | filter request.is_root_span == true | fieldsAdd ... | summarize ...
  • 关联文件: service-metrics.md(高级服务分析章节)
3. 对比查询
  • 使用
    append
    进行基线对比
  • 使用
    shift: -15m
    进行时间偏移基线对比
  • 示例: 性能降级检测

Response Construction Guidelines

响应构建指南

Always include:
  1. Metric name(s) - Clear metric identifiers
  2. Aggregation - How data is aggregated (avg, sum, percentile)
  3. Grouping - Dimensions used (
    dt.service.name
    ,
    k8s.workload.name
    , etc.)
  4. Unit conversion - Convert microseconds to milliseconds where appropriate
  5. Filtering - Relevant thresholds or conditions
When referencing runtime-specific content:
  • Check user's technology stack first
  • Provide only relevant runtime queries (don't overwhelm with all 6 runtimes)
  • Explain runtime-specific metrics (e.g., "OPcache hit ratio" measures PHP opcode cache efficiency)

始终包含以下内容:
  1. 指标名称 - 清晰的指标标识符
  2. 聚合方式 - 数据的聚合逻辑(平均值、总和、分位数)
  3. 分组维度 - 使用的分组维度(
    dt.service.name
    k8s.workload.name
    等)
  4. 单位转换 - 按需将微秒转换为毫秒
  5. 过滤条件 - 相关阈值或筛选条件
引用特定运行时内容时:
  • 优先确认用户的技术栈
  • 仅提供相关运行时的查询(不要展示全部6种运行时的内容造成信息过载)
  • 解释特定运行时指标的含义(例如:"OPcache命中率"用于衡量PHP opcode缓存的效率)

Common Workflows

常见工作流程

Workflow: Service Health Check

工作流程:服务健康检查

1. Check response time (RED metrics)
2. Check error rate (RED metrics)
3. Check traffic patterns (RED metrics)
4. If runtime-specific issues suspected → Load runtime-specific reference
1. 检查响应时间(RED指标)
2. 检查错误率(RED指标)
3. 检查流量模式(RED指标)
4. 如果怀疑存在运行时特定问题 → 加载对应运行时参考文档

Workflow: SLA Monitoring

工作流程:SLA监控

1. Define SLA criteria (e.g., < 3s response time AND < 1% error rate)
2. Use span-based query for custom SLA logic
3. Calculate compliance percentage
4. Filter non-compliant services
1. 定义SLA标准(例如:响应时间<3s且错误率<1%)
2. 使用基于Span的查询实现自定义SLA逻辑
3. 计算合规率
4. 筛选出不合规的服务

Workflow: Service Mesh Analysis

工作流程:服务网格分析

1. Check mesh response time
2. Compare mesh vs direct performance
3. Calculate mesh overhead
4. Analyze mesh failure rates
1. 检查网格响应时间
2. 对比网格请求与直连请求的性能
3. 计算网格开销
4. 分析网格失败率

Workflow: Runtime Troubleshooting

工作流程:运行时问题排查

  1. Identify technology stack → Load runtime-specific reference
  2. Check memory/GC metrics → threads/goroutines → runtime features

  1. 确认技术栈 → 加载对应运行时参考文档
  2. 检查内存/GC指标 → 线程/goroutines → 运行时特性

References

参考文档

Core Service Monitoring:
  • references/service-metrics.md - Complete RED metrics, SLA tracking, service mesh queries
Runtime-Specific Monitoring:
  • references/java.md - Java/JVM monitoring
  • references/nodejs.md - Node.js monitoring
  • references/dotnet.md - .NET CLR monitoring
  • references/python.md - Python monitoring
  • references/php.md - PHP monitoring
  • references/go.md - Go runtime monitoring
核心服务监控:
  • references/service-metrics.md - 完整的RED指标、SLA跟踪、服务网格查询
特定运行时监控:
  • references/java.md - Java/JVM监控
  • references/nodejs.md - Node.js监控
  • references/dotnet.md - .NET CLR监控
  • references/python.md - Python监控
  • references/php.md - PHP监控
  • references/go.md - Go运行时监控