dt-obs-problems

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Problem Analysis Skill

问题分析技能

Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.

分析Dynatrace AI检测到的问题，包括根本原因识别、影响评估，以及与日志和指标的关联。

Overview

概述

Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating problems that aggregate related alert, warning and info-level events and provide root cause and impact insights.

Dynatrace会自动检测您整个环境中的异常、性能下降和故障，生成问题实体，聚合相关的告警、警告和信息级事件，并提供根本原因和影响分析洞察。

What are Problems?

什么是问题？

Problems are automatically detected, software and infrastructure health and resilience issues that:

Automatically correlate related alert, warning, and info-level events across services, infrastructure, frontend applications, and user sessions
Identify root causes using causal analysis of Smartscape dependencies
Assess business impact by tracking affected users and services
Reduce alert noise by grouping related symptoms into single problems that share the same root cause and impact
Track problem lifecycle from early detection through resolution

问题是自动检测到的软件和基础设施健康与韧性相关问题，具备以下特性：

自动关联 跨服务、基础设施、前端应用和用户会话的相关告警、警告和信息级事件
基于Smartscape依赖关系的因果分析 识别根本原因
通过跟踪受影响的用户和服务 评估业务影响
将共享同一根本原因和影响的相关症状归为单个问题，减少告警噪音
从早期检测到解决的全流程 跟踪问题生命周期

Event Kinds

事件类型

The

event.kind

field (stable, permission) identifies the high-level event type:

`event.kind` value	Description
`DAVIS_EVENT`	Davis-detected infrastructure/application events
`BIZ_EVENT`	Business events (ingested via API or captured from spans)
`RUM_EVENT`	Real User Monitoring events
`AUDIT_EVENT`	Administrative/security audit events

event.provider

(stable, permission) identifies the event source.

event.kind

字段（稳定、受权限控制）用于标识高层级事件类型：

`event.kind` 值	描述
`DAVIS_EVENT`	Davis检测到的基础设施/应用事件
`BIZ_EVENT`	业务事件（通过API接入或从span中捕获）
`RUM_EVENT`	真实用户监控事件
`AUDIT_EVENT`	管理/安全审计事件

event.provider

（稳定、受权限控制）用于标识事件来源。

Problem Categories

问题分类

Common

event.category

values:

Category	Description	Example
AVAILABILITY	Infrastructure or service unavailable	Web service returns no data, synthetic test actively fails, database connection lost
ERROR	Increased error rates beyond baseline	API error rate jumped from 0.1% to 15%
SLOWDOWN	Performance degradation	Response time increased from 200ms to 5000ms
RESOURCE	Resource saturation	Container memory at 95%, causing OOM kills
CUSTOM	Custom anomaly detections	Business KPI (orders/minute) dropped below threshold

常见的

event.category

取值：

分类	描述	示例
AVAILABILITY	基础设施或服务不可用	Web服务无返回数据、合成监控测试主动失败、数据库连接丢失
ERROR	错误率超出基线阈值	API错误率从0.1%跃升至15%
SLOWDOWN	性能下降	响应时间从200ms升高到5000ms
RESOURCE	资源饱和	容器内存占用达95%，触发OOM终止
CUSTOM	自定义异常检测	业务KPI（每分钟订单量）跌破阈值

Problem Lifecycle

问题生命周期

text

Detection → ACTIVE → Under Investigation → CLOSED

ACTIVE: Currently occurring issues requiring attention
CLOSED: Resolved issues used for historical analysis

text

检测 → ACTIVE → 调查中 → CLOSED

ACTIVE：当前正在发生、需要处理的问题
CLOSED：已解决的问题，用于历史分析

Essential Fields

核心字段

Common Field Name Mistakes

常见字段名错误

❌ WRONG	✅ CORRECT	Description
`title`	`event.name`	Problem title/description
`status`	`event.status`	Problem lifecycle status
`severity`	`event.category`	Problem type/category
`start`	`event.start`	Problem start time

❌ 错误写法	✅ 正确写法	描述
`title`	`event.name`	问题标题/描述
`status`	`event.status`	问题生命周期状态
`severity`	`event.category`	问题类型/分类
`start`	`event.start`	问题开始时间

Correct Status Values

正确的状态取值

dql

// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE"   // Currently occurring problems
//     or event.status == "CLOSED"  // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1

dql

// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE"   // Currently occurring problems
//     or event.status == "CLOSED"  // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1

Key Fields Reference

关键字段参考

dql

fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
    event.start,                          // Problem start timestamp
    event.end,                            // Problem end timestamp (if closed)
    display_id,                           // Human-readable problem ID (P-XXXXX)
    event.name,                           // Problem title
    event.description,                    // Detailed description
    event.category,                       // Problem type
    event.status,                         // ACTIVE or CLOSED
    dt.smartscape_source.id,              // The smartscape ID for the affected resource
    dt.davis.affected_users_count,        // Number of affected users
    smartscape.affected_entity.ids,        // Array of affected entity IDs
    dt.smartscape.service,                // Affected services (may be array)
    dt.davis.root_cause_entity,           // Entity identified as root cause
    root_cause_entity_id,                 // Root cause entity ID
    root_cause_entity_name,               // Human-readable root cause name
    dt.davis.is_duplicate,                // Whether duplicate detection
    dt.davis.is_rootcause                 // Root cause vs. symptom
| limit 10

dql

fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
    event.start,                          // Problem start timestamp
    event.end,                            // Problem end timestamp (if closed)
    display_id,                           // Human-readable problem ID (P-XXXXX)
    event.name,                           // Problem title
    event.description,                    // Detailed description
    event.category,                       // Problem type
    event.status,                         // ACTIVE or CLOSED
    dt.smartscape_source.id,              // The smartscape ID for the affected resource
    dt.davis.affected_users_count,        // Number of affected users
    smartscape.affected_entity.ids,        // Array of affected entity IDs
    dt.smartscape.service,                // Affected services (may be array)
    dt.davis.root_cause_entity,           // Entity identified as root cause
    root_cause_entity_id,                 // Root cause entity ID
    root_cause_entity_name,               // Human-readable root cause name
    dt.davis.is_duplicate,                // Whether duplicate detection
    dt.davis.is_rootcause                 // Root cause vs. symptom
| limit 10

Standard Query Pattern

标准查询模板

Always start problem queries with this foundation:

dql

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20

Key components:

```
fetch dt.davis.problems
```
- The problems data source
```
not(dt.davis.is_duplicate)
```
- Filter out duplicate detections
```
event.status == "ACTIVE"
```
- Show only active problems
Time range - Always specify a reasonable window

问题查询建议始终从以下基础模板开始：

dql

fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20

核心组成部分：

```
fetch dt.davis.problems
```
- 问题数据源
```
not(dt.davis.is_duplicate)
```
- 过滤掉重复检测结果
```
event.status == "ACTIVE"
```
- 仅展示活跃问题
时间范围 - 始终指定合理的查询时间窗口

Common Query Patterns

常用查询模板

Active Problems by Category

按分类统计活跃问题

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count desc

High-Impact Active Problems (affecting many users)

高影响活跃问题（影响大量用户）

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count desc

High-Impact Active Problems (affecting many smartscape entities)

高影响活跃问题（影响大量smartscape实体）

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc

dql

fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count desc

Specific Problem Details

特定问题详情查询

dql

fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name

dql

fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_name

Service-Specific Problem History

特定服务的问题历史

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}

Important: Entity Filter DO and DON'T

重要提示：实体过滤的注意事项

DO use array-safe filters and include both deprecated and Smartscape service fields when filtering by service ID:

dql

| filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))

DON'T use scalar equality on service fields or only one field variant:

dql

// Wrong: not array-safe and misses Smartscape-only matches
| filter dt.entity.service == "SERVICE-00E66996F1555897"

推荐做法：按服务ID过滤时使用数组安全的过滤方式，同时包含已废弃和Smartscape服务字段：

dql

| filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897"))

不推荐做法：对服务字段使用标量相等判断，或仅使用单个字段变体：

dql

// Wrong: not array-safe and misses Smartscape-only matches
| filter dt.entity.service == "SERVICE-00E66996F1555897"

Root Cause Analysis Patterns

根本原因分析模板

Basic Root Cause Query

基础根本原因查询

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
    display_id,
    event.name,
    event.description,
    root_cause_entity_id,
    root_cause_entity_name,
    smartscape.affected_entity.ids

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
    display_id,
    event.name,
    event.description,
    root_cause_entity_id,
    root_cause_entity_name,
    smartscape.affected_entity.ids

Root Cause by Entity Type

按实体类型统计根本原因

Identify which entity types most frequently cause problems:

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20

识别哪些实体类型最常引发问题：

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20

Affected entity is an AWS resource

受影响实体为AWS资源的问题

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")

Infrastructure Root Cause with Service Impact

影响服务的基础设施根本原因

dql

fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service

dql

fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.service

Problem Blast Radius

问题影响范围

Calculate entity impact per root cause:

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
    avg_affected = avg(affected_count),
    max_affected = max(affected_count),
    problem_count = count(),
    by:{root_cause_entity_name}
| sort avg_affected desc

统计每个根本原因影响的实体数量：

dql

fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
    avg_affected = avg(affected_count),
    max_affected = max(affected_count),
    problem_count = count(),
    by:{root_cause_entity_name}
| sort avg_affected desc

Recurring Root Causes

重复出现的根本原因

Identify entities repeatedly causing problems:

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
    problem_count = count(),
    first_occurrence = min(event.start),
    last_occurrence = max(event.start),
    by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc

识别反复引发问题的实体：

dql

fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
    problem_count = count(),
    first_occurrence = min(event.start),
    last_occurrence = max(event.start),
    by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc

Problem Trending and Pattern Analysis

问题趋势与模式分析

Track problem trends over time, identify recurring issues, and analyze resolution performance.

Primary Files:

```
references/problem-trending.md
```
- Timeseries analysis and pattern detection

Common Use Cases:

Active problems over time with
```
makeTimeseries
```
Problem creation rate by category
Recurring problem detection by schedule
Resolution time trends and P95 duration analysis

Key Techniques:

makeTimeseries
vs bin()
: Choose the right approach for lifecycle spans vs discrete events
NULL handling: Use
```
coalesce(event.end, now())
```
for active problems
Peak hours analysis: Identify when problems occur most frequently
Impact trending: Track user impact changes over time

See

references/problem-trending.md

for complete query patterns and best practices.

跟踪问题随时间的变化趋势，识别重复出现的问题，分析解决效率。

核心文档：

```
references/problem-trending.md
```
- 时间序列分析与模式检测

常见用例：

使用
```
makeTimeseries
```
统计随时间变化的活跃问题
按分类统计问题创建速率
按计划检测重复出现的问题
解决时间趋势与P95时长分析

关键技巧：

makeTimeseries
对比 bin()
：针对生命周期跨度和离散事件选择合适的方案
NULL处理：对活跃问题使用
```
coalesce(event.end, now())
```
高峰时段分析：识别问题最常发生的时间
影响趋势：跟踪用户影响随时间的变化

完整查询模板和最佳实践请参考

references/problem-trending.md

。

Best Practices

最佳实践

Essential Rules

核心规则

Always filter duplicates: Use
```
not(dt.davis.is_duplicate)
```
to avoid counting the same problem multiple times
Use correct status values:
```
"ACTIVE"
```
or
```
"CLOSED"
```
, never
```
"OPEN"
```
Specify time ranges: Always include time bounds to optimize performance
Include display_id: Essential for problem identification and linking
Test incrementally: Add one filter or field at a time when building queries
Filter early: Apply
```
not(dt.davis.is_duplicate)
```
immediately after fetch

始终过滤重复项：使用
```
not(dt.davis.is_duplicate)
```
避免重复统计同一个问题
使用正确的状态值：仅用
```
"ACTIVE"
```
或
```
"CLOSED"
```
，不要使用
```
"OPEN"
```
指定时间范围：始终包含时间边界以优化查询性能
包含display_id：是问题识别和链接的核心字段
增量测试：构建查询时每次仅添加一个过滤条件或字段
尽早过滤：在fetch之后立即应用
```
not(dt.davis.is_duplicate)
```

Query Development

查询开发建议

Start simple: Begin with basic filtering, then add complexity
Test fields first: Run with
```
| limit 1
```
to verify field names exist
Use meaningful time ranges: Too broad wastes resources, too narrow misses data
Document problem IDs: Always capture and store
```
display_id
```
for reference

从简单开始：先实现基础过滤，再逐步增加复杂度
先测试字段：使用
```
| limit 1
```
运行验证字段名是否存在
使用合理的时间范围：范围过宽浪费资源，过窄会遗漏数据
记录问题ID：始终捕获并存储
```
display_id
```
用于后续参考

Root Cause Verification

根本原因验证

Always filter
```
isNotNull(root_cause_entity_id)
```
when required
Cross-reference events using
```
dt.davis.event_ids
```
Consider time delays: root cause may appear in logs minutes before problem

必要时始终过滤
```
isNotNull(root_cause_entity_id)
```
使用
```
dt.davis.event_ids
```
交叉引用事件
考虑时间延迟：根本原因可能在问题出现前几分钟就已出现在日志中

Time Range Guidelines

时间范围指引

dql

// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h

dql

// ❌ BAD - Scans all historical data
fetch dt.davis.problems

dql

// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4h

dql

// ❌ BAD - Scans all historical data
fetch dt.davis.problems