dt-obs-tracing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Application Tracing Skill

应用追踪技能

Overview

概述

Distributed traces in Dynatrace consist of spans - building blocks representing units of work. With Traces in Grail, every span is accessible via DQL with full-text searchability on all attributes. This skill covers trace fundamentals, common analysis patterns, and span-type specific queries.
Dynatrace中的分布式追踪由span组成,span是代表工作单元的基础构建块。借助Grail中的追踪能力,所有span都可以通过DQL访问,且所有属性都支持全文搜索。本技能涵盖追踪基础概念、常用分析模式以及不同span类型的专属查询方法。

Core Concepts

核心概念

Understanding Traces and Spans

理解追踪与Span

Spans represent logical units of work in distributed traces:
  • HTTP requests, RPC calls, database operations
  • Messaging system interactions
  • Internal function invocations
  • Custom instrumentation points
Span kinds:
  • span.kind: server
    - Incoming call to a service
  • span.kind: client
    - Outgoing call from a service
  • span.kind: consumer
    - Incoming message consumption call to a service
  • span.kind: producer
    - Outgoing message production call from a service
  • span.kind: internal
    - Internal operation within a service
Root spans: A request root span (
request.is_root_span == true
) represents an incoming call to a service. Use this to analyze end-to-end request performance.
Span 代表分布式追踪中的逻辑工作单元:
  • HTTP请求、RPC调用、数据库操作
  • 消息系统交互
  • 内部函数调用
  • 自定义埋点
Span类型
  • span.kind: server
    - 服务的入站调用
  • span.kind: client
    - 服务的出站调用
  • span.kind: consumer
    - 服务的入站消息消费调用
  • span.kind: producer
    - 服务的出站消息生产调用
  • span.kind: internal
    - 服务内部的操作
根Span:请求根span(
request.is_root_span == true
)代表服务的入站调用,可用于分析端到端的请求性能。

Key Trace Attributes

核心追踪属性

Essential attributes for trace analysis:
AttributeDescription
trace.id
Unique trace identifier
span.id
Unique span identifier
span.parent_id
Parent span ID (null for root spans)
request.is_root_span
Boolean, true for request entry points
request.is_failed
Boolean, true if request failed
duration
Span duration in nanoseconds
span.timing.cpu
Overall CPU time of the span (stable)
span.timing.cpu_self
CPU time excluding child spans (stable)
dt.smartscape.service
Service Smartscape node ID
dt.service.name
Dynatrace service name derived from service detection rules. It is equal to the Smartscape service node name.
endpoint.name
Endpoint/route name
追踪分析的必备属性:
属性描述
trace.id
唯一追踪标识符
span.id
唯一span标识符
span.parent_id
父span ID(根span为null)
request.is_root_span
布尔值,请求入口点为true
request.is_failed
布尔值,请求失败时为true
duration
span耗时,单位为纳秒
span.timing.cpu
span的总CPU耗时(稳定值)
span.timing.cpu_self
排除子span后的CPU耗时(稳定值)
dt.smartscape.service
服务Smartscape节点ID
dt.service.name
根据服务检测规则生成的Dynatrace服务名称,与Smartscape服务节点名称一致。
endpoint.name
端点/路由名称

Service Context

服务上下文

Spans reference services via Smartscape node IDs and the detected service name
dt.service.name
which is also present on every span.
dql
fetch spans
| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }
Span通过Smartscape节点ID和检测到的服务名称
dt.service.name
关联对应服务,每个span上都会携带该字段。
dql
fetch spans
| summarize spans=count(), by: { dt.smartscape.service, dt.service.name }

Sampling and Extrapolation

采样与外推

One span can represent multiple real operations due to:
  • Aggregation: Multiple operations in one span (
    aggregation.count
    )
  • ATM (Adaptive Traffic Management): Head-based sampling by agent
  • ALR (Adaptive Load Reduction): Server-side sampling
  • Read Sampling: Query-time sampling via
    samplingRatio
    parameter
When to extrapolate: Always extrapolate when counting actual operations (not just spans). Use the multiplicity factor:
dql
fetch spans
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1 / sampling.probability
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
                         * coalesce(aggregation.count, 1)
                         * dt.system.sampling_ratio
| summarize operation_count = sum(multiplicity)
📖 Learn more: See Sampling and Extrapolation for detailed formulas and examples.
单个span可能代表多个实际操作,原因包括:
  • 聚合:多个操作合并为一个span(
    aggregation.count
  • ATM(自适应流量管理):Agent端的头部采样
  • ALR(自适应负载削减):服务端采样
  • 读取采样:查询时通过
    samplingRatio
    参数采样
何时需要外推:统计实际操作数量(而非span数量)时必须外推,使用乘数因子计算:
dql
fetch spans
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1 / sampling.probability
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
                         * coalesce(aggregation.count, 1)
                         * dt.system.sampling_ratio
| summarize operation_count = sum(multiplicity)
📖 了解更多:查看采样与外推获取详细公式与示例。

Common Query Patterns

常用查询模式

Basic Span Access

基础Span查询

Fetch spans and explore by type:
dql
fetch spans | limit 1
Explore spans by function and type:
dql
fetch spans
| summarize count(), by: { span.kind, code.namespace, code.function }
拉取span并按类型探索:
dql
fetch spans | limit 1
按函数和类型探索span:
dql
fetch spans
| summarize count(), by: { span.kind, code.namespace, code.function }

Request Root Filtering

请求根Span过滤

List request root spans (incoming service calls):
dql
fetch spans
| filter request.is_root_span == true
| fields trace.id, span.id, start_time, response_time = duration, endpoint.name
| limit 100
列出请求根span(服务入站调用):
dql
fetch spans
| filter request.is_root_span == true
| fields trace.id, span.id, start_time, response_time = duration, endpoint.name
| limit 100

Service Performance Summary

服务性能汇总

Analyze service performance with error rates:
dql
fetch spans
| filter request.is_root_span == true
| summarize
    total_requests = count(),
    failed_requests = countIf(request.is_failed == true),
    avg_duration = avg(duration),
    p95_duration = percentile(duration, 95),
    by: {dt.service.name}
| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests
| sort error_rate desc
分析服务性能与错误率:
dql
fetch spans
| filter request.is_root_span == true
| summarize
    total_requests = count(),
    failed_requests = countIf(request.is_failed == true),
    avg_duration = avg(duration),
    p95_duration = percentile(duration, 95),
    by: {dt.service.name}
| fieldsAdd error_rate = (failed_requests * 100.0) / total_requests
| sort error_rate desc

Trace ID Lookup

Trace ID查询

Find all spans in a specific trace:
dql
fetch spans
| filter trace.id == toUid("abc123def456")
| fields span.name, duration, dt.service.name
查找指定追踪下的所有span:
dql
fetch spans
| filter trace.id == toUid("abc123def456")
| fields span.name, duration, dt.service.name

Performance Analysis

性能分析

Response Time Percentiles

响应时间百分位

Calculate percentiles by endpoint:
dql
fetch spans
| filter request.is_root_span == true
| summarize {
    requests=count(),
    avg_duration=avg(duration),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
  }, by: { endpoint.name }
| sort p99 desc
💡 Best practice: Use percentiles (p95, p99) over averages for performance insights.
按端点计算百分位:
dql
fetch spans
| filter request.is_root_span == true
| summarize {
    requests=count(),
    avg_duration=avg(duration),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
  }, by: { endpoint.name }
| sort p99 desc
💡 最佳实践:性能分析优先使用百分位(p95、p99)而非平均值。

Slow Trace Detection

慢追踪检测

Find requests exceeding a threshold:
dql
fetch spans, from:now() - 2h
| filter request.is_root_span == true
| filter duration > 5s
| fields trace.id, span.name, dt.service.name, duration
| sort duration desc
| limit 50
查找超过阈值的请求:
dql
fetch spans, from:now() - 2h
| filter request.is_root_span == true
| filter duration > 5s
| fields trace.id, span.name, dt.service.name, duration
| sort duration desc
| limit 50

Duration Buckets with Exemplars

带示例的耗时分桶

Group requests into duration buckets with example traces:
dql
fetch spans, from:now() - 24h
| filter http.route == "/api/v1/storage/findByISBN"
| summarize {
    spans=count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { bin(duration, 10ms) }
| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]
将请求分组到耗时分桶中并附带示例追踪:
dql
fetch spans, from:now() - 24h
| filter http.route == "/api/v1/storage/findByISBN"
| summarize {
    spans=count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { bin(duration, 10ms) }
| fields `bin(duration, 10ms)`, spans, trace.id=trace[trace.id], start_time=trace[start_time]

Performance Timeseries

性能时序数据

Extract response time as timeseries:
dql
fetch spans, from:now() - 24h
| filter request.is_root_span == true
| makeTimeseries {
    requests=count(),
    avg_duration=avg(duration),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
  }, by: { endpoint.name }
📖 Learn more: See Performance Analysis for advanced patterns and timeseries techniques.
提取响应时间时序数据:
dql
fetch spans, from:now() - 24h
| filter request.is_root_span == true
| makeTimeseries {
    requests=count(),
    avg_duration=avg(duration),
    p95=percentile(duration, 95),
    p99=percentile(duration, 99)
  }, by: { endpoint.name }
📖 了解更多:查看性能分析获取高级模式与时序分析技巧。

Failure Investigation

故障排查

Failed Request Summary

失败请求汇总

Summarize failures by service:
dql
fetch spans
| filter request.is_root_span == true
| summarize
    total = count(),
    failed = countIf(request.is_failed == true),
  by: { dt.service.name }
| fieldsAdd failure_rate = (failed * 100.0) / total
| sort failure_rate desc
按服务汇总故障情况:
dql
fetch spans
| filter request.is_root_span == true
| summarize
    total = count(),
    failed = countIf(request.is_failed == true),
  by: { dt.service.name }
| fieldsAdd failure_rate = (failed * 100.0) / total
| sort failure_rate desc

Failure Reason Analysis

故障原因分析

Breakdown by failure detection reason:
dql
fetch spans
| filter request.is_failed == true and isNotNull(dt.failure_detection.results)
| expand dt.failure_detection.results
| summarize count(), by: { dt.failure_detection.results[reason] }
Failure reasons:
  • http_code
    - HTTP response code triggered failure
  • grpc_code
    - gRPC status code triggered failure
  • exception
    - Exception caused failure
  • span_status
    - Span status indicated failure
  • custom_rule
    - Custom failure detection rule matched
按故障检测原因拆分统计:
dql
fetch spans
| filter request.is_failed == true and isNotNull(dt.failure_detection.results)
| expand dt.failure_detection.results
| summarize count(), by: { dt.failure_detection.results[reason] }
故障原因
  • http_code
    - HTTP响应码触发故障
  • grpc_code
    - gRPC状态码触发故障
  • exception
    - 异常导致故障
  • span_status
    - Span状态标识故障
  • custom_rule
    - 匹配自定义故障检测规则

HTTP Code Failures

HTTP状态码故障

Find failures by HTTP status code:
dql
fetch spans
| filter request.is_failed == true
| filter iAny(dt.failure_detection.results[][reason] == "http_code")
| summarize count(), by: { http.response.status_code, endpoint.name }
| sort `count()` desc
按HTTP状态码查询故障:
dql
fetch spans
| filter request.is_failed == true
| filter iAny(dt.failure_detection.results[][reason] == "http_code")
| summarize count(), by: { http.response.status_code, endpoint.name }
| sort `count()` desc

Recent Failed Requests

最近失败请求

List recent failures with details:
dql
fetch spans
| filter request.is_root_span == true and request.is_failed == true
| fields
    start_time,
    trace.id,
    endpoint.name,
    http.response.status_code,
    duration
| sort start_time desc
| limit 100
📖 Learn more: See Failure Detection for exception analysis and custom rule investigation.
列出最近的故障详情:
dql
fetch spans
| filter request.is_root_span == true and request.is_failed == true
| fields
    start_time,
    trace.id,
    endpoint.name,
    http.response.status_code,
    duration
| sort start_time desc
| limit 100
📖 了解更多:查看故障检测获取异常分析与自定义规则排查方法。

Service Dependencies

服务依赖

Service Communication

服务通信

Analyze incoming and outgoing service communication:
dql
fetch spans, from:now() - 1h
| filter isNotNull(server.address)
| fieldsAdd
    remote_side = server.address
| summarize
    call_count = count(),
    avg_duration = avg(duration),
    by: {dt.service.name, remote_side}
| sort call_count desc
分析入站与出站服务通信:
dql
fetch spans, from:now() - 1h
| filter isNotNull(server.address)
| fieldsAdd
    remote_side = server.address
| summarize
    call_count = count(),
    avg_duration = avg(duration),
    by: {dt.service.name, remote_side}
| sort call_count desc

Outgoing HTTP Calls

出站HTTP调用

Identify external API dependencies:
dql
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
    calls = count(),
    avg_latency = avg(duration),
    p99_latency = percentile(duration, 99),
  by: { dt.service.name, server.address, server.port }
| sort calls desc
识别外部API依赖:
dql
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
    calls = count(),
    avg_latency = avg(duration),
    p99_latency = percentile(duration, 99),
  by: { dt.service.name, server.address, server.port }
| sort calls desc

Trace Aggregation

追踪聚合

Complete Trace Analysis

完整追踪分析

Aggregate all spans in a trace to understand full request flow:
dql
fetch spans, from:now() - 30m
| summarize {
    spans = count(),
    client_spans = countIf(span.kind == "client"),

    // Endpoints involved in the trace
    endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),

    // Extract the first request root in the trace
    trace_root = takeMin(record(
        root_detection_helper = coalesce(
            if(request.is_root_span, 1),
            if(isNull(span.parent_id), 2),
            3),
        start_time, endpoint.name, duration
      ))
}, by: { trace.id }

| fieldsFlatten trace_root
| fieldsRemove trace_root.root_detection_helper, trace_root

| fields
    start_time = trace_root.start_time,
    endpoint = trace_root.endpoint.name,
    response_time = trace_root.duration,
    spans,
    client_spans,
    endpoints,
    trace.id
| sort start_time
| limit 100
Root detection strategy: Use
takeMin(record(...))
with a detection helper to reliably find the root request:
  1. Priority 1: Spans with
    request.is_root_span == true
  2. Priority 2: Spans without parent (root spans)
  3. Priority 3: All other spans
聚合单个追踪下的所有span,了解完整请求流:
dql
fetch spans, from:now() - 30m
| summarize {
    spans = count(),
    client_spans = countIf(span.kind == "client"),

    // 追踪涉及的端点
    endpoints = toString(arrayRemoveNulls(collectDistinct(endpoint.name))),

    // 提取追踪中的第一个请求根span
    trace_root = takeMin(record(
        root_detection_helper = coalesce(
            if(request.is_root_span, 1),
            if(isNull(span.parent_id), 2),
            3),
        start_time, endpoint.name, duration
      ))
}, by: { trace.id }

| fieldsFlatten trace_root
| fieldsRemove trace_root.root_detection_helper, trace_root

| fields
    start_time = trace_root.start_time,
    endpoint = trace_root.endpoint.name,
    response_time = trace_root.duration,
    spans,
    client_spans,
    endpoints,
    trace.id
| sort start_time
| limit 100
根span检测策略:结合检测辅助字段使用
takeMin(record(...))
可以可靠找到根请求:
  1. 优先级1:
    request.is_root_span == true
    的span
  2. 优先级2:无父节点的span(根span)
  3. 优先级3:所有其他span

Multi-Service Traces

多服务追踪

Find traces spanning multiple services:
dql
fetch spans, from:now() - 1h
| summarize {
    services = collectDistinct(dt.service.name),
    trace_root = takeMin(record(
        root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
        endpoint.name
      ))
}, by: { trace.id }
| fieldsAdd service_count = arraySize(services)
| filter service_count > 1
| fields
    endpoint = trace_root[endpoint.name],
    service_count,
    services = toString(services),
    trace.id
| sort service_count desc
| limit 50
查找跨多个服务的追踪:
dql
fetch spans, from:now() - 1h
| summarize {
    services = collectDistinct(dt.service.name),
    trace_root = takeMin(record(
        root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
        endpoint.name
      ))
}, by: { trace.id }
| fieldsAdd service_count = arraySize(services)
| filter service_count > 1
| fields
    endpoint = trace_root[endpoint.name],
    service_count,
    services = toString(services),
    trace.id
| sort service_count desc
| limit 50

Request-Level Analysis

请求级分析

Request Attributes

请求属性

Access custom request attributes captured by OneAgent on request root spans:
dql
fetch spans
| filter request.is_root_span == true
| filter isNotNull(request_attribute.PaidAmount)
| makeTimeseries sum(request_attribute.PaidAmount)
Field pattern:
request_attribute.<name>
For attributes with special characters, use backticks:
dql
fetch spans
| filter isNotNull(`request_attribute.My Customer ID`)
访问OneAgent在请求根span上采集的自定义请求属性:
dql
fetch spans
| filter request.is_root_span == true
| filter isNotNull(request_attribute.PaidAmount)
| makeTimeseries sum(request_attribute.PaidAmount)
字段格式
request_attribute.<name>
包含特殊字符的属性需要使用反引号:
dql
fetch spans
| filter isNotNull(`request_attribute.My Customer ID`)

Captured Attributes

采集属性

Access attributes captured from method parameters (always as arrays):
dql
fetch spans
| filter isNotNull(captured_attribute.BookID_purchased)
| fields trace.id, span.id, code.namespace, code.function, captured_attribute.BookID_purchased
| limit 1
Field pattern:
captured_attribute.<name>
访问从方法参数中采集的属性(始终为数组类型):
dql
fetch spans
| filter isNotNull(captured_attribute.BookID_purchased)
| fields trace.id, span.id, code.namespace, code.function, captured_attribute.BookID_purchased
| limit 1
字段格式
captured_attribute.<name>

Request ID Aggregation

请求ID聚合

Aggregate all spans belonging to a single request using
request.id
(OneAgent traces only):
dql
fetch spans
| filter isNotNull(request.id)
| summarize {
    spans = count(),
    client_spans = countIf(span.kind == "client"),
    request_root = takeMin(record(
        root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
        start_time, endpoint.name, duration
      ))
}, by: { trace.id, request.id }
| fieldsFlatten request_root
| fields
    start_time = request_root.start_time,
    endpoint = request_root.endpoint.name,
    response_time = request_root.duration,
    spans,
    client_spans
| limit 100
📖 Learn more: See Request Attributes for complete patterns on request attributes, captured attributes, and request-level aggregation.
使用
request.id
聚合属于单个请求的所有span(仅支持OneAgent采集的追踪):
dql
fetch spans
| filter isNotNull(request.id)
| summarize {
    spans = count(),
    client_spans = countIf(span.kind == "client"),
    request_root = takeMin(record(
        root_detection_helper = coalesce(if(request.is_root_span, 1), 2),
        start_time, endpoint.name, duration
      ))
}, by: { trace.id, request.id }
| fieldsFlatten request_root
| fields
    start_time = request_root.start_time,
    endpoint = request_root.endpoint.name,
    response_time = request_root.duration,
    spans,
    client_spans
| limit 100
📖 了解更多:查看请求属性获取请求属性、采集属性与请求级聚合的完整用法。

Span Types

Span类型

HTTP Spans

HTTP Span

HTTP spans capture web requests and API calls:
Server-side (incoming requests):
dql
fetch spans
| filter span.kind == "server" and isNotNull(http.request.method)
| summarize
    requests = count(),
    avg_duration = avg(duration),
  by: { http.request.method, http.route }
| sort requests desc
Client-side (outgoing calls):
dql
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
    calls = count(),
    avg_duration = avg(duration),
  by: { server.address, http.request.method }
| sort calls desc
📖 Learn more: See HTTP Span Analysis for status codes, payload analysis, and client IP tracking.
HTTP Span采集Web请求与API调用:
服务端(入站请求):
dql
fetch spans
| filter span.kind == "server" and isNotNull(http.request.method)
| summarize
    requests = count(),
    avg_duration = avg(duration),
  by: { http.request.method, http.route }
| sort requests desc
客户端(出站调用):
dql
fetch spans
| filter span.kind == "client" and isNotNull(http.request.method)
| summarize
    calls = count(),
    avg_duration = avg(duration),
  by: { server.address, http.request.method }
| sort calls desc
📖 了解更多:查看HTTP Span分析获取状态码、Payload分析与客户端IP追踪方法。

Database Spans

数据库Span

Database operations appear as client spans with
db.*
attributes:
dql
fetch spans
| filter span.kind == "client" and isNotNull(db.system) and isNotNull(db.namespace)
| summarize {
    spans=count(),
    avg_duration=avg(duration)
  }, by: { dt.service.name, db.system, db.namespace }
| sort spans desc
⚠️ Important: Database spans can be aggregated (one span = multiple calls). Always use extrapolation for accurate counts.
📖 Learn more: See Database Span Analysis for extrapolated counts and slow query detection.
数据库操作为带有
db.*
属性的客户端span:
dql
fetch spans
| filter span.kind == "client" and isNotNull(db.system) and isNotNull(db.namespace)
| summarize {
    spans=count(),
    avg_duration=avg(duration)
  }, by: { dt.service.name, db.system, db.namespace }
| sort spans desc
⚠️ 重要提示:数据库span可能是聚合后的结果(1个span代表多次调用),统计准确数量时必须使用外推。
📖 了解更多:查看数据库Span分析获取外推统计与慢查询检测方法。

Messaging Spans

消息队列Span

Messaging spans capture Kafka, RabbitMQ, SQS operations:
dql
fetch spans
| filter isNotNull(messaging.system)
| summarize
    spans = count(),
    messages = sum(coalesce(messaging.batch.message_count, 1)),
  by: { messaging.system, messaging.destination.name, messaging.operation.type }
| sort messages desc
📖 Learn more: See Messaging Span Analysis for throughput, latency, and failure patterns.
消息队列span采集Kafka、RabbitMQ、SQS操作:
dql
fetch spans
| filter isNotNull(messaging.system)
| summarize
    spans = count(),
    messages = sum(coalesce(messaging.batch.message_count, 1)),
  by: { messaging.system, messaging.destination.name, messaging.operation.type }
| sort messages desc
📖 了解更多:查看消息队列Span分析获取吞吐量、延迟与故障模式分析方法。

RPC Spans

RPC Span

RPC spans cover gRPC, SOAP, and other RPC frameworks:
dql
fetch spans
| filter isNotNull(rpc.system)
| summarize
    calls = count(),
    avg_duration = avg(duration),
  by: { rpc.system, rpc.service, rpc.method }
| sort calls desc
📖 Learn more: See RPC Span Analysis for gRPC status codes and service dependencies.
RPC Span涵盖gRPC、SOAP等RPC框架:
dql
fetch spans
| filter isNotNull(rpc.system)
| summarize
    calls = count(),
    avg_duration = avg(duration),
  by: { rpc.system, rpc.service, rpc.method }
| sort calls desc
📖 了解更多:查看RPC Span分析获取gRPC状态码与服务依赖分析方法。

Serverless Spans

无服务Span

FaaS spans capture Lambda, Azure Functions, and GCP Cloud Functions:
dql
fetch spans
| filter isNotNull(faas.name) and span.kind == "server"
| summarize
    invocations = count(),
    avg_duration = avg(duration),
    p99_duration = percentile(duration, 99),
  by: { faas.name, cloud.provider }
| sort invocations desc
📖 Learn more: See Serverless Span Analysis for cold start analysis and trigger types.
FaaS span采集Lambda、Azure Functions、GCP Cloud Functions操作:
dql
fetch spans
| filter isNotNull(faas.name) and span.kind == "server"
| summarize
    invocations = count(),
    avg_duration = avg(duration),
    p99_duration = percentile(duration, 99),
  by: { faas.name, cloud.provider }
| sort invocations desc
📖 了解更多:查看无服务Span分析获取冷启动分析与触发器类型相关内容。

Advanced Topics

高级主题

Exception Analysis

异常分析

Exceptions are stored as
span.events
within spans:
dql
fetch spans
| filter iAny(span.events[][span_event.name] == "exception")
| expand span.events
| fieldsFlatten span.events, fields: { exception.type }
| summarize {
    count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { exception.type }
| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]
💡 Tip: Use
iAny()
to check conditions within span event arrays.
异常存储在span的
span.events
字段中:
dql
fetch spans
| filter iAny(span.events[][span_event.name] == "exception")
| expand span.events
| fieldsFlatten span.events, fields: { exception.type }
| summarize {
    count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { exception.type }
| fields exception.type, `count()`, trace.id=trace[trace.id], start_time=trace[start_time]
💡 提示:使用
iAny()
检查span事件数组中的条件。

Logs and Traces Correlation

日志与追踪关联

Join logs with traces using trace IDs:
dql
fetch spans, from:now() - 30m
| join [ fetch logs | fieldsAdd trace.id = toUid(trace_id) ]
  , on: { trace.id }
  , fields: { content, loglevel }
| fields start_time, trace.id, span.id, loglevel, content
| limit 100
📖 Learn more: See Logs Correlation for filtering traces by log content and finding logs for failed requests.
通过trace ID关联日志与追踪:
dql
fetch spans, from:now() - 30m
| join [ fetch logs | fieldsAdd trace.id = toUid(trace_id) ]
  , on: { trace.id }
  , fields: { content, loglevel }
| fields start_time, trace.id, span.id, loglevel, content
| limit 100
📖 了解更多:查看日志关联获取按日志内容过滤追踪、查找失败请求对应日志的方法。

Network Analysis

网络分析

Analyze IP addresses, DNS resolution, and client geography:
dql
fetch spans, from:now() - 24h
| filter isNotNull(client.ip)
| fieldsAdd client.ip = toIp(client.ip)
| fieldsAdd client.subnet = ipMask(client.ip, 24)
| summarize {
    requests=count(),
    unique_clients=countDistinct(client.ip)
  }, by: { client.subnet, endpoint.name }
| sort requests desc
📖 Learn more: See Network Analysis for server address resolution and communication mapping.
分析IP地址、DNS解析与客户端地域:
dql
fetch spans, from:now() - 24h
| filter isNotNull(client.ip)
| fieldsAdd client.ip = toIp(client.ip)
| fieldsAdd client.subnet = ipMask(client.ip, 24)
| summarize {
    requests=count(),
    unique_clients=countDistinct(client.ip)
  }, by: { client.subnet, endpoint.name }
| sort requests desc
📖 了解更多:查看网络分析获取服务地址解析与通信映射相关内容。

Best Practices

最佳实践

Query Optimization

查询优化

  • Filter early: Apply
    request.is_root_span == true
    and endpoint filters first
  • Use
    samplingRatio
    : Reduce data volume for better performance (e.g.,
    samplingRatio:100
    reads 1%)
  • Limit results: Always use
    limit
    for exploratory queries
  • Percentiles over averages: Use p95/p99 for performance insights
  • 提前过滤:优先应用
    request.is_root_span == true
    与端点过滤条件
  • 使用
    samplingRatio
    :降低数据量提升查询性能(例如
    samplingRatio:100
    仅读取1%数据)
  • 限制结果:探索性查询始终使用
    limit
  • 百分位优于平均值:使用p95/p99获取更准确的性能洞察

Node Lookups

节点查询

  • Use
    getNodeName()
    : Simplest way to add service names
  • Prefer subqueries: Use Smartscape node filters and
    traverse
    for filtering
  • Cache node info: Store node lookups in fields for reuse
  • 使用
    getNodeName()
    :添加服务名称的最简单方式
  • 优先使用子查询:结合Smartscape节点过滤器与
    traverse
    进行过滤
  • 缓存节点信息:将节点查询结果存储在字段中复用

Aggregation Patterns

聚合模式

  • Request roots: Use
    request.is_root_span == true
    for end-to-end analysis
  • Trace-level: Group by
    trace.id
    for complete trace metrics
  • Request-level: Group by
    request.id
    for request metrics (OneAgent traces only)
  • Always extrapolate: Use multiplicity for accurate operation counts
  • 请求根span:端到端分析使用
    request.is_root_span == true
  • 追踪级:按
    trace.id
    分组获取完整追踪指标
  • 请求级:仅OneAgent采集的追踪可按
    request.id
    分组获取请求指标
  • 始终外推:使用乘数因子获取准确的操作计数

Trace Exemplars

追踪示例

Include example traces for drilldown:
dql
| summarize {
    count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { grouping_field }
| fields ..., trace.id=trace[trace.id], start_time=trace[start_time]
This enables "Open With" functionality in Dynatrace UI.

聚合时包含示例追踪便于下钻分析:
dql
| summarize {
    count(),
    trace=takeAny(record(start_time, trace.id))
  }, by: { grouping_field }
| fields ..., trace.id=trace[trace.id], start_time=trace[start_time]
该写法支持Dynatrace UI中的「打开方式」功能。

References

参考资料

Detailed documentation for specific topics:
  • Performance Analysis - Advanced timeseries, duration buckets, endpoint ranking
  • Failure Detection - Failure reasons, exception investigation, custom rules
  • Sampling and Extrapolation - Multiplicity calculation, database extrapolation
  • Request Attributes - Request attributes, captured attributes, request ID aggregation
  • Entity Lookups - Advanced node lookups, infrastructure correlation, hardware analysis
  • HTTP Span Analysis - Status codes, payload analysis, client IPs
  • Database Span Analysis - Extrapolated counts, slow queries, statement analysis
  • Messaging Span Analysis - Kafka, RabbitMQ, SQS throughput and latency
  • RPC Span Analysis - gRPC, SOAP, service dependencies
  • Serverless Span Analysis - Lambda, Azure Functions, cold start analysis
  • Logs Correlation - Joining logs and traces, correlation patterns
  • Network Analysis - IP addresses, DNS resolution, communication mapping

特定主题的详细文档:
  • 性能分析 - 高级时序分析、耗时分桶、端点排名
  • 故障检测 - 故障原因、异常排查、自定义规则
  • 采样与外推 - 乘数计算、数据库外推
  • 请求属性 - 请求属性、采集属性、请求ID聚合
  • 实体查询 - 高级节点查询、基础设施关联、硬件分析
  • HTTP Span分析 - 状态码、Payload分析、客户端IP
  • 数据库Span分析 - 外推计数、慢查询、语句分析
  • 消息队列Span分析 - Kafka、RabbitMQ、SQS吞吐量与延迟
  • RPC Span分析 - gRPC、SOAP、服务依赖
  • 无服务Span分析 - Lambda、Azure Functions、冷启动分析
  • 日志关联 - 日志与追踪关联、关联模式
  • 网络分析 - IP地址、DNS解析、通信映射

Related Skills

相关技能

  • dt-dql-essentials - Core DQL syntax for querying trace data
  • dt-app-dashboards - Embed trace queries in dashboards
  • dt-migration - Smartscape entity model and relationship navigation
  • dt-dql-essentials - 查询追踪数据的核心DQL语法
  • dt-app-dashboards - 在仪表盘中嵌入追踪查询
  • dt-migration - Smartscape实体模型与关系遍历