dt-obs-problems
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProblem Analysis Skill
问题分析技能
Analyze Dynatrace AI-detected problems including root cause identification, impact assessment, and correlation with logs and metrics.
分析Dynatrace AI检测到的问题,包括根本原因识别、影响评估,以及与日志和指标的关联。
Overview
概述
Dynatrace automatically detects anomalies, performance degradations, and failures across your environment, creating problems that aggregate related alert, warning and info-level events and provide root cause and impact insights.
Dynatrace会自动检测您整个环境中的异常、性能下降和故障,生成问题实体,聚合相关的告警、警告和信息级事件,并提供根本原因和影响分析洞察。
What are Problems?
什么是问题?
Problems are automatically detected, software and infrastructure health and resilience issues that:
- Automatically correlate related alert, warning, and info-level events across services, infrastructure, frontend applications, and user sessions
- Identify root causes using causal analysis of Smartscape dependencies
- Assess business impact by tracking affected users and services
- Reduce alert noise by grouping related symptoms into single problems that share the same root cause and impact
- Track problem lifecycle from early detection through resolution
问题是自动检测到的软件和基础设施健康与韧性相关问题,具备以下特性:
- 自动关联 跨服务、基础设施、前端应用和用户会话的相关告警、警告和信息级事件
- 基于Smartscape依赖关系的因果分析 识别根本原因
- 通过跟踪受影响的用户和服务 评估业务影响
- 将共享同一根本原因和影响的相关症状归为单个问题,减少告警噪音
- 从早期检测到解决的全流程 跟踪问题生命周期
Event Kinds
事件类型
The field (stable, permission) identifies the high-level event type:
event.kind | Description |
|---|---|
| Davis-detected infrastructure/application events |
| Business events (ingested via API or captured from spans) |
| Real User Monitoring events |
| Administrative/security audit events |
event.providerevent.kind | 描述 |
|---|---|
| Davis检测到的基础设施/应用事件 |
| 业务事件(通过API接入或从span中捕获) |
| 真实用户监控事件 |
| 管理/安全审计事件 |
event.providerProblem Categories
问题分类
Common values:
event.category| Category | Description | Example |
|---|---|---|
| AVAILABILITY | Infrastructure or service unavailable | Web service returns no data, synthetic test actively fails, database connection lost |
| ERROR | Increased error rates beyond baseline | API error rate jumped from 0.1% to 15% |
| SLOWDOWN | Performance degradation | Response time increased from 200ms to 5000ms |
| RESOURCE | Resource saturation | Container memory at 95%, causing OOM kills |
| CUSTOM | Custom anomaly detections | Business KPI (orders/minute) dropped below threshold |
常见的取值:
event.category| 分类 | 描述 | 示例 |
|---|---|---|
| AVAILABILITY | 基础设施或服务不可用 | Web服务无返回数据、合成监控测试主动失败、数据库连接丢失 |
| ERROR | 错误率超出基线阈值 | API错误率从0.1%跃升至15% |
| SLOWDOWN | 性能下降 | 响应时间从200ms升高到5000ms |
| RESOURCE | 资源饱和 | 容器内存占用达95%,触发OOM终止 |
| CUSTOM | 自定义异常检测 | 业务KPI(每分钟订单量)跌破阈值 |
Problem Lifecycle
问题生命周期
text
Detection → ACTIVE → Under Investigation → CLOSED- ACTIVE: Currently occurring issues requiring attention
- CLOSED: Resolved issues used for historical analysis
text
检测 → ACTIVE → 调查中 → CLOSED- ACTIVE:当前正在发生、需要处理的问题
- CLOSED:已解决的问题,用于历史分析
Essential Fields
核心字段
Common Field Name Mistakes
常见字段名错误
| ❌ WRONG | ✅ CORRECT | Description |
|---|---|---|
| | Problem title/description |
| | Problem lifecycle status |
| | Problem type/category |
| | Problem start time |
| ❌ 错误写法 | ✅ 正确写法 | 描述 |
|---|---|---|
| | 问题标题/描述 |
| | 问题生命周期状态 |
| | 问题类型/分类 |
| | 问题开始时间 |
Correct Status Values
正确的状态取值
dql
// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE" // Currently occurring problems
// or event.status == "CLOSED" // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1dql
// ✅ CORRECT: Use these status values
fetch dt.davis.problems
| filter event.status == "ACTIVE" // Currently occurring problems
// or event.status == "CLOSED" // Resolved problems
// ❌ INCORRECT: event.status == "OPEN" does not exist!
| limit 1Key Fields Reference
关键字段参考
dql
fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
event.start, // Problem start timestamp
event.end, // Problem end timestamp (if closed)
display_id, // Human-readable problem ID (P-XXXXX)
event.name, // Problem title
event.description, // Detailed description
event.category, // Problem type
event.status, // ACTIVE or CLOSED
dt.smartscape_source.id, // The smartscape ID for the affected resource
dt.davis.affected_users_count, // Number of affected users
smartscape.affected_entity.ids, // Array of affected entity IDs
dt.smartscape.service, // Affected services (may be array)
dt.davis.root_cause_entity, // Entity identified as root cause
root_cause_entity_id, // Root cause entity ID
root_cause_entity_name, // Human-readable root cause name
dt.davis.is_duplicate, // Whether duplicate detection
dt.davis.is_rootcause // Root cause vs. symptom
| limit 10dql
fetch dt.davis.problems, from:now() - 1h
| filter not(dt.davis.is_duplicate)
| fields
event.start, // Problem start timestamp
event.end, // Problem end timestamp (if closed)
display_id, // Human-readable problem ID (P-XXXXX)
event.name, // Problem title
event.description, // Detailed description
event.category, // Problem type
event.status, // ACTIVE or CLOSED
dt.smartscape_source.id, // The smartscape ID for the affected resource
dt.davis.affected_users_count, // Number of affected users
smartscape.affected_entity.ids, // Array of affected entity IDs
dt.smartscape.service, // Affected services (may be array)
dt.davis.root_cause_entity, // Entity identified as root cause
root_cause_entity_id, // Root cause entity ID
root_cause_entity_name, // Human-readable root cause name
dt.davis.is_duplicate, // Whether duplicate detection
dt.davis.is_rootcause // Root cause vs. symptom
| limit 10Standard Query Pattern
标准查询模板
Always start problem queries with this foundation:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20Key components:
- - The problems data source
fetch dt.davis.problems - - Filter out duplicate detections
not(dt.davis.is_duplicate) - - Show only active problems
event.status == "ACTIVE" - Time range - Always specify a reasonable window
问题查询建议始终从以下基础模板开始:
dql
fetch dt.davis.problems, from:now() - 2h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields event.start, display_id, event.name, event.category
| sort event.start desc
| limit 20核心组成部分:
- - 问题数据源
fetch dt.davis.problems - - 过滤掉重复检测结果
not(dt.davis.is_duplicate) - - 仅展示活跃问题
event.status == "ACTIVE" - 时间范围 - 始终指定合理的查询时间窗口
Common Query Patterns
常用查询模板
Active Problems by Category
按分类统计活跃问题
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count descdql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| summarize problem_count = count(), by: {event.category}
| sort problem_count descHigh-Impact Active Problems (affecting many users)
高影响活跃问题(影响大量用户)
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count descdql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter dt.davis.affected_users_count > 100
| fields event.start, display_id, event.name, dt.davis.affected_users_count, event.category
| sort dt.davis.affected_users_count descHigh-Impact Active Problems (affecting many smartscape entities)
高影响活跃问题(影响大量smartscape实体)
dql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count descdql
fetch dt.davis.problems
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter arraySize(affected_entity_ids) > 5
| fields event.start, display_id, event.name, affected_entity_ids, event.category, impacted_entity_count = arraySize(affected_entity_ids)
| sort impacted_entity_count descSpecific Problem Details
特定问题详情查询
dql
fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_namedql
fetch dt.davis.problems
| filter display_id == "P-XXXXXXXXXX"
| fields event.start, event.end, event.name, event.description, affected_entity_ids, dt.davis.affected_users_count, root_cause_entity_id, root_cause_entity_nameService-Specific Problem History
特定服务的问题历史
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter in(dt.entity.service, "SERVICE-XXXXXXXXX") or in(dt.smartscape.service, toSmartscapeId("SERVICE-XXXXXXXXX"))
| summarize problems = count(), by: {event.category, event.status}Important: Entity Filter DO and DON'T
重要提示:实体过滤的注意事项
-
DO use array-safe filters and include both deprecated and Smartscape service fields when filtering by service ID:dql
| filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897")) -
DON'T use scalar equality on service fields or only one field variant:dql
// Wrong: not array-safe and misses Smartscape-only matches | filter dt.entity.service == "SERVICE-00E66996F1555897"
-
推荐做法:按服务ID过滤时使用数组安全的过滤方式,同时包含已废弃和Smartscape服务字段:dql
| filter in(dt.entity.service, "SERVICE-00E66996F1555897") or in(dt.smartscape.service, toSmartscapeId("SERVICE-00E66996F1555897")) -
不推荐做法:对服务字段使用标量相等判断,或仅使用单个字段变体:dql
// Wrong: not array-safe and misses Smartscape-only matches | filter dt.entity.service == "SERVICE-00E66996F1555897"
Root Cause Analysis Patterns
根本原因分析模板
Basic Root Cause Query
基础根本原因查询
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
display_id,
event.name,
event.description,
root_cause_entity_id,
root_cause_entity_name,
smartscape.affected_entity.idsdql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| fields
display_id,
event.name,
event.description,
root_cause_entity_id,
root_cause_entity_name,
smartscape.affected_entity.idsRoot Cause by Entity Type
按实体类型统计根本原因
Identify which entity types most frequently cause problems:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20识别哪些实体类型最常引发问题:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize problem_count = count(), by:{root_cause_entity_name}
| sort problem_count desc
| limit 20Affected entity is an AWS resource
受影响实体为AWS资源的问题
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(arrayToString(smartscape.affected_entity.types, delimiter:","), "AWS_")Infrastructure Root Cause with Service Impact
影响服务的基础设施根本原因
dql
fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.servicedql
fetch dt.davis.problems, from:now() - 30m
| filter not(dt.davis.is_duplicate) and event.status == "ACTIVE"
| filter matchesPhrase(root_cause_entity_id, "HOST-")
| filter isNotNull(dt.smartscape.service)
| fields display_id, event.name, root_cause_entity_name, dt.smartscape.serviceProblem Blast Radius
问题影响范围
Calculate entity impact per root cause:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
avg_affected = avg(affected_count),
max_affected = max(affected_count),
problem_count = count(),
by:{root_cause_entity_name}
| sort avg_affected desc统计每个根本原因影响的实体数量:
dql
fetch dt.davis.problems, from:now() - 7d
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| fieldsAdd affected_count = arraySize(smartscape.affected_entity.ids)
| summarize
avg_affected = avg(affected_count),
max_affected = max(affected_count),
problem_count = count(),
by:{root_cause_entity_name}
| sort avg_affected descRecurring Root Causes
重复出现的根本原因
Identify entities repeatedly causing problems:
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
problem_count = count(),
first_occurrence = min(event.start),
last_occurrence = max(event.start),
by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count desc识别反复引发问题的实体:
dql
fetch dt.davis.problems, from:now() - 24h
| filter not(dt.davis.is_duplicate)
| filter isNotNull(root_cause_entity_id)
| summarize
problem_count = count(),
first_occurrence = min(event.start),
last_occurrence = max(event.start),
by:{root_cause_entity_id, root_cause_entity_name}
| filter problem_count > 3
| sort problem_count descProblem Trending and Pattern Analysis
问题趋势与模式分析
Track problem trends over time, identify recurring issues, and analyze resolution performance.
Primary Files:
- - Timeseries analysis and pattern detection
references/problem-trending.md
Common Use Cases:
- Active problems over time with
makeTimeseries - Problem creation rate by category
- Recurring problem detection by schedule
- Resolution time trends and P95 duration analysis
Key Techniques:
- vs
makeTimeseries: Choose the right approach for lifecycle spans vs discrete eventsbin() - NULL handling: Use for active problems
coalesce(event.end, now()) - Peak hours analysis: Identify when problems occur most frequently
- Impact trending: Track user impact changes over time
See for complete query patterns and best practices.
references/problem-trending.md跟踪问题随时间的变化趋势,识别重复出现的问题,分析解决效率。
核心文档:
- - 时间序列分析与模式检测
references/problem-trending.md
常见用例:
- 使用统计随时间变化的活跃问题
makeTimeseries - 按分类统计问题创建速率
- 按计划检测重复出现的问题
- 解决时间趋势与P95时长分析
关键技巧:
- 对比
makeTimeseries:针对生命周期跨度和离散事件选择合适的方案bin() - NULL处理:对活跃问题使用
coalesce(event.end, now()) - 高峰时段分析:识别问题最常发生的时间
- 影响趋势:跟踪用户影响随时间的变化
完整查询模板和最佳实践请参考。
references/problem-trending.mdBest Practices
最佳实践
Essential Rules
核心规则
- Always filter duplicates: Use to avoid counting the same problem multiple times
not(dt.davis.is_duplicate) - Use correct status values: or
"ACTIVE", never"CLOSED""OPEN" - Specify time ranges: Always include time bounds to optimize performance
- Include display_id: Essential for problem identification and linking
- Test incrementally: Add one filter or field at a time when building queries
- Filter early: Apply immediately after fetch
not(dt.davis.is_duplicate)
- 始终过滤重复项:使用避免重复统计同一个问题
not(dt.davis.is_duplicate) - 使用正确的状态值:仅用或
"ACTIVE",不要使用"CLOSED""OPEN" - 指定时间范围:始终包含时间边界以优化查询性能
- 包含display_id:是问题识别和链接的核心字段
- 增量测试:构建查询时每次仅添加一个过滤条件或字段
- 尽早过滤:在fetch之后立即应用
not(dt.davis.is_duplicate)
Query Development
查询开发建议
- Start simple: Begin with basic filtering, then add complexity
- Test fields first: Run with to verify field names exist
| limit 1 - Use meaningful time ranges: Too broad wastes resources, too narrow misses data
- Document problem IDs: Always capture and store for reference
display_id
- 从简单开始:先实现基础过滤,再逐步增加复杂度
- 先测试字段:使用运行验证字段名是否存在
| limit 1 - 使用合理的时间范围:范围过宽浪费资源,过窄会遗漏数据
- 记录问题ID:始终捕获并存储用于后续参考
display_id
Root Cause Verification
根本原因验证
- Always filter when required
isNotNull(root_cause_entity_id) - Cross-reference events using
dt.davis.event_ids - Consider time delays: root cause may appear in logs minutes before problem
- 必要时始终过滤
isNotNull(root_cause_entity_id) - 使用交叉引用事件
dt.davis.event_ids - 考虑时间延迟:根本原因可能在问题出现前几分钟就已出现在日志中
Time Range Guidelines
时间范围指引
dql
// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4hdql
// ❌ BAD - Scans all historical data
fetch dt.davis.problemsdql
// ✅ GOOD - Specific time range
fetch dt.davis.problems, from:now() - 4hdql
// ❌ BAD - Scans all historical data
fetch dt.davis.problemsRelated Documentation
相关文档
- references/problem-trending.md: Problem trending and timeseries analysis patterns
- references/problem-correlation.md: Correlating problems with logs and other telemetry
- references/impact-analysis.md: Business and technical impact assessment
- references/problem-merging.md: When and why DAVIS merges events into problems
- references/problem-trending.md:问题趋势与时间序列分析模板
- references/problem-correlation.md:问题与日志及其他遥测数据的关联
- references/impact-analysis.md:业务与技术影响评估
- references/problem-merging.md:DAVIS将事件合并为问题的场景和原因
Related Skills
相关技能
- dt-dql-essentials - Core DQL syntax and query structure for problem queries
- dt-obs-logs - Correlate problems with application and infrastructure logs
- dt-obs-tracing - Investigate problems through distributed trace analysis
- dt-dql-essentials - 问题查询所需的核心DQL语法和查询结构
- dt-obs-logs - 将问题与应用和基础设施日志关联
- dt-obs-tracing - 通过分布式链路追踪分析调查问题