dt-obs-problems

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Problem Analysis Skill

问题分析技能

Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.
分析Dynatrace AI检测到的问题,包括根本原因识别、影响评估,以及与日志和指标的关联。

Overview

概述

Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating problems that aggregate related alert, warning and info-level events and provide root cause and impact insights.
Dynatrace会自动检测您整个环境中的异常、性能下降和故障,生成问题实体,聚合相关的告警、警告和信息级事件,并提供根本原因和影响分析洞察。

What are Problems?

什么是问题?

Problems are automatically detected, software and infrastructure health and resilience issues that:
  • Automatically correlate related alert, warning, and info-level events across services, infrastructure, frontend applications, and user sessions
  • Identify root causes using causal analysis of Smartscape dependencies
  • Assess business impact by tracking affected users and services
  • Reduce alert noise by grouping related symptoms into single problems that share the same root cause and impact
  • Track problem lifecycle from early detection through resolution
问题是自动检测到的软件和基础设施健康与韧性相关问题,具备以下特性:
  • 自动关联 跨服务、基础设施、前端应用和用户会话的相关告警、警告和信息级事件
  • 基于Smartscape依赖关系的因果分析 识别根本原因
  • 通过跟踪受影响的用户和服务 评估业务影响
  • 将共享同一根本原因和影响的相关症状归为单个问题,减少告警噪音
  • 从早期检测到解决的全流程 跟踪问题生命周期

Event Kinds

事件类型

The
event.kind
field (stable, permission) identifies the high-level event type:
event.kind
value
Description
DAVIS_EVENT
Davis-detected infrastructure/application events
BIZ_EVENT
Business events (ingested via API or captured from spans)
RUM_EVENT
Real User Monitoring events
AUDIT_EVENT
Administrative/security audit events
event.provider
(stable, permission) identifies the event source.
event.kind
字段(稳定、受权限控制)用于标识高层级事件类型:
event.kind
描述
DAVIS_EVENT
Davis检测到的基础设施/应用事件
BIZ_EVENT
业务事件(通过API接入或从span中捕获)
RUM_EVENT
真实用户监控事件
AUDIT_EVENT
管理/安全审计事件
event.provider
(稳定、受权限控制)用于标识事件来源。

Problem Categories

问题分类

Common
event.category
values:
CategoryDescriptionExample
AVAILABILITYInfrastructure or service unavailableWeb service returns no data, synthetic test actively fails, database connection lost
ERRORIncreased error rates beyond baselineAPI error rate jumped from 0.1% to 15%
SLOWDOWNPerformance degradationResponse time increased from 200ms to 5000ms
RESOURCEResource saturationContainer memory at 95%, causing OOM kills
CUSTOMCustom anomaly detectionsBusiness KPI (orders/minute) dropped below threshold
常见的
event.category
取值:
分类描述示例
AVAILABILITY基础设施或服务不可用Web服务无返回数据、合成监控测试主动失败、数据库连接丢失
ERROR错误率超出基线阈值API错误率从0.1%跃升至15%
SLOWDOWN性能下降响应时间从200ms升高到5000ms
RESOURCE资源饱和容器内存占用达95%,触发OOM终止
CUSTOM自定义异常检测业务KPI(每分钟订单量)跌破阈值

Problem Lifecycle

问题生命周期

text
Detection → ACTIVE → Under Investigation → CLOSED
  • ACTIVE: Currently occurring issues requiring attention
  • CLOSED: Resolved issues used for historical analysis
text
检测 → ACTIVE → 调查中 → CLOSED
  • ACTIVE:当前正在发生、需要处理的问题
  • CLOSED:已解决的问题,用于历史分析

Essential Fields

核心字段

Common Field Name Mistakes

常见字段名错误

❌ WRONG✅ CORRECTDescription
title
event.name
Problem title/description
status
event.status
Problem lifecycle status
severity
event.category
Problem type/category
start
event.start
Problem start time
❌ 错误写法✅ 正确写法描述
title
event.name
问题标题/描述
status
event.status
问题生命周期状态
severity
event.category
问题类型/分类
start
event.start
问题开始时间

Correct Status Values

正确的状态取值

dql
// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE"   // Currently occurring problems
//     or event.status == "CLOSED"  // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1
dql
// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE"   // Currently occurring problems
//     or event.status == "CLOSED"  // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1

Key Fields Reference

关键字段参考

dql
fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
    event.start,                          // Problem start timestamp
    event.end,                            // Problem end timestamp (if closed)
    display_id,                           // Human-readable problem ID (P-XXXXX)
    event.name,                           // Problem title
    event.description,                    // Detailed description
    event.category,                       // Problem type
    event.status,                         // ACTIVE or CLOSED
    dt.smartscape_source.id,              // The smartscape ID for the affected resource
    dt.davis.affected_users_count,        // Number of affected users
    smartscape.affected_entity.ids,        // Array of affected entity IDs
    dt.smartscape.service,                // Affected services (may be array)
    dt.davis.root_cause_entity,           // Entity identified as root cause
    root_cause_entity_id,                 // Root cause entity ID
    root_cause_entity_name,               // Human-readable root cause name
    dt.davis.is_duplicate,                // Whether duplicate detection
    dt.davis.is_rootcause                 // Root cause vs. symptom
| limit 10
dql
fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
    event.start,                          // Problem start timestamp
    event.end,                            // Problem end timestamp (if closed)
    display_id,                           // Human-readable problem ID (P-XXXXX)
    event.name,                           // Problem title
    event.description,                    // Detailed description
    event.category,                       // Problem type
    event.status,                         // ACTIVE or CLOSED
    dt.smartscape_source.id,              // The smartscape ID for the affected resource
    dt.davis.affected_users_count,        // Number of affected users
    smartscape.affected_entity.ids,        // Array of affected entity IDs
    dt.smartscape.service,                // Affected services (may be array)
    dt.davis.root_cause_entity,           // Entity identified as root cause
    root_cause_entity_id,                 // Root cause entity ID
    root_cause_entity_name,               // Human-readable root cause name
    dt.davis.is_duplicate,                // Whether duplicate detection
    dt.davis.is_rootcause                 // Root cause vs. symptom
| limit 10

Standard Query Pattern

标准查询模板

Always start problem queries with this foundation:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20
Key components:
  • fetch dt.davis.problems
    - The problems data source
  • not(dt.davis.is_duplicate)
    - Filter out duplicate detections
  • event.status == "ACTIVE"
    - Show only active problems
  • Time range - Always specify a reasonable window
问题查询建议始终从以下基础模板开始:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20
核心组成部分:
  • fetch dt.davis.problems
    - 问题数据源
  • not(dt.davis.is_duplicate)
    - 过滤掉重复检测结果
  • event.status == "ACTIVE"
    - 仅展示活跃问题
  • 时间范围 - 始终指定合理的查询时间窗口

Common Query Patterns

常用查询模板

Active Problems by Category

按分类统计活跃问题

dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc

High-Impact Active Problems (affecting many users)

高影响活跃问题(影响大量用户)

dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc

High-Impact Active Problems (affecting many smartscape entities)

高影响活跃问题(影响大量smartscape实体)

dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc

Specific Problem Details

特定问题详情查询

dql
fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name
dql
fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name

Service-Specific Problem History

特定服务的问题历史

dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}

Important: Entity Filter DO and DON'T

重要提示:实体过滤的注意事项

  • DO use array-safe filters and include both deprecated and Smartscape service fields when filtering by service ID:
    dql
    | filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))
  • DON'T use scalar equality on service fields or only one field variant:
    dql
    // Wrong: not array-safe and misses Smartscape-only matches
    | filter dt.entity.service == "SERVICE-00E66996F1555897"
  • 推荐做法:按服务ID过滤时使用数组安全的过滤方式,同时包含已废弃和Smartscape服务字段:
    dql
    | filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))
  • 不推荐做法:对服务字段使用标量相等判断,或仅使用单个字段变体:
    dql
    // Wrong: not array-safe and misses Smartscape-only matches
    | filter dt.entity.service == "SERVICE-00E66996F1555897"

Root Cause Analysis Patterns

根本原因分析模板

Basic Root Cause Query

基础根本原因查询

dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
    display_id,
    event.name,
    event.description,
    root_cause_entity_id,
    root_cause_entity_name,
    smartscape.affected_entity.ids
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
    display_id,
    event.name,
    event.description,
    root_cause_entity_id,
    root_cause_entity_name,
    smartscape.affected_entity.ids

Root Cause by Entity Type

按实体类型统计根本原因

Identify which entity types most frequently cause problems:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20
识别哪些实体类型最常引发问题:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20

Affected entity is an AWS resource

受影响实体为AWS资源的问题

dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")

Infrastructure Root Cause with Service Impact

影响服务的基础设施根本原因

dql
fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service
dql
fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service

Problem Blast Radius

问题影响范围

Calculate entity impact per root cause:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
    avg_affected = avg(affected_count),
    max_affected = max(affected_count),
    problem_count = count(),
    by:{root_cause_entity_name}
| sort avg_affected desc
统计每个根本原因影响的实体数量:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
    avg_affected = avg(affected_count),
    max_affected = max(affected_count),
    problem_count = count(),
    by:{root_cause_entity_name}
| sort avg_affected desc

Recurring Root Causes

重复出现的根本原因

Identify entities repeatedly causing problems:
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
    problem_count = count(),
    first_occurrence = min(event.start),
    last_occurrence = max(event.start),
    by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc
识别反复引发问题的实体:
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
    problem_count = count(),
    first_occurrence = min(event.start),
    last_occurrence = max(event.start),
    by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc

Problem Trending and Pattern Analysis

问题趋势与模式分析

Track problem trends over time, identify recurring issues, and analyze resolution performance.
Primary Files:
  • references/problem-trending.md
    - Timeseries analysis and pattern detection
Common Use Cases:
  • Active problems over time with
    makeTimeseries
  • Problem creation rate by category
  • Recurring problem detection by schedule
  • Resolution time trends and P95 duration analysis
Key Techniques:
  • makeTimeseries
    vs
    bin()
    : Choose the right approach for lifecycle spans vs discrete events
  • NULL handling: Use
    coalesce(event.end, now())
    for active problems
  • Peak hours analysis: Identify when problems occur most frequently
  • Impact trending: Track user impact changes over time
See
references/problem-trending.md
for complete query patterns and best practices.
跟踪问题随时间的变化趋势,识别重复出现的问题,分析解决效率。
核心文档:
  • references/problem-trending.md
    - 时间序列分析与模式检测
常见用例:
  • 使用
    makeTimeseries
    统计随时间变化的活跃问题
  • 按分类统计问题创建速率
  • 按计划检测重复出现的问题
  • 解决时间趋势与P95时长分析
关键技巧:
  • makeTimeseries
    对比
    bin()
    :针对生命周期跨度和离散事件选择合适的方案
  • NULL处理:对活跃问题使用
    coalesce(event.end, now())
  • 高峰时段分析:识别问题最常发生的时间
  • 影响趋势:跟踪用户影响随时间的变化
完整查询模板和最佳实践请参考
references/problem-trending.md

Best Practices

最佳实践

Essential Rules

核心规则

  1. Always filter duplicates: Use
    not(dt.davis.is_duplicate)
    to avoid counting the same problem multiple times
  2. Use correct status values:
    "ACTIVE"
    or
    "CLOSED"
    , never
    "OPEN"
  3. Specify time ranges: Always include time bounds to optimize performance
  4. Include display_id: Essential for problem identification and linking
  5. Test incrementally: Add one filter or field at a time when building queries
  6. Filter early: Apply
    not(dt.davis.is_duplicate)
    immediately after fetch
  1. 始终过滤重复项:使用
    not(dt.davis.is_duplicate)
    避免重复统计同一个问题
  2. 使用正确的状态值:仅用
    "ACTIVE"
    "CLOSED"
    ,不要使用
    "OPEN"
  3. 指定时间范围:始终包含时间边界以优化查询性能
  4. 包含display_id:是问题识别和链接的核心字段
  5. 增量测试:构建查询时每次仅添加一个过滤条件或字段
  6. 尽早过滤:在fetch之后立即应用
    not(dt.davis.is_duplicate)

Query Development

查询开发建议

  • Start simple: Begin with basic filtering, then add complexity
  • Test fields first: Run with
    | limit 1
    to verify field names exist
  • Use meaningful time ranges: Too broad wastes resources, too narrow misses data
  • Document problem IDs: Always capture and store
    display_id
    for reference
  • 从简单开始:先实现基础过滤,再逐步增加复杂度
  • 先测试字段:使用
    | limit 1
    运行验证字段名是否存在
  • 使用合理的时间范围:范围过宽浪费资源,过窄会遗漏数据
  • 记录问题ID:始终捕获并存储
    display_id
    用于后续参考

Root Cause Verification

根本原因验证

  • Always filter
    isNotNull(root_cause_entity_id)
    when required
  • Cross-reference events using
    dt.davis.event_ids
  • Consider time delays: root cause may appear in logs minutes before problem
  • 必要时始终过滤
    isNotNull(root_cause_entity_id)
  • 使用
    dt.davis.event_ids
    交叉引用事件
  • 考虑时间延迟:根本原因可能在问题出现前几分钟就已出现在日志中

Time Range Guidelines

时间范围指引

dql
// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h
dql
// ❌ BAD - Scans all historical data
fetch dt.davis.problems
dql
// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h
dql
// ❌ BAD - Scans all historical data
fetch dt.davis.problems

Related Documentation

相关文档

  • references/problem-trending.md: Problem trending and timeseries analysis patterns
  • references/problem-correlation.md: Correlating problems with logs and other telemetry
  • references/impact-analysis.md: Business and technical impact assessment
  • references/problem-merging.md: When and why DAVIS merges events into problems
  • references/problem-trending.md:问题趋势与时间序列分析模板
  • references/problem-correlation.md:问题与日志及其他遥测数据的关联
  • references/impact-analysis.md:业务与技术影响评估
  • references/problem-merging.md:DAVIS将事件合并为问题的场景和原因

Related Skills

相关技能

  • dt-dql-essentials - Core DQL syntax and query structure for problem queries
  • dt-obs-logs - Correlate problems with application and infrastructure logs
  • dt-obs-tracing - Investigate problems through distributed trace analysis
  • dt-dql-essentials - 问题查询所需的核心DQL语法和查询结构
  • dt-obs-logs - 将问题与应用和基础设施日志关联
  • dt-obs-tracing - 通过分布式链路追踪分析调查问题