dt-obs-frontends

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Frontend Observability Skill

前端可观测性Skill

Monitor web and mobile frontends using Real User Monitoring (RUM) with DQL queries. This skill targets the new RUM experience only; do not use classic RUM data.
通过DQL查询使用真实用户监控(RUM)能力监控网页端和移动端前端。 本Skill仅适配新版RUM体验,请勿用于经典RUM数据查询。

Overview

概述

This skill helps you:
  • Monitor Core Web Vitals and frontend performance
  • Track user sessions, engagement, and behavior
  • Analyze errors and correlate with backend traces
  • Optimize mobile app startup and stability
  • Diagnose performance issues with detailed timing analysis
Data Sources:
  • Metrics:
    timeseries
    with
    dt.frontend.*
    (trends, alerting)
  • Events:
    fetch user.events
    (individual page views, requests, clicks, errors)
  • Sessions:
    fetch user.sessions
    (session-level aggregates: duration, bounce, counts)
本Skill可帮助你:
  • 监控Core Web Vitals和前端性能
  • 追踪用户会话、参与度和行为
  • 分析错误并与后端链路关联
  • 优化移动应用启动速度和稳定性
  • 通过详细的耗时分析诊断性能问题
数据源:
  • 指标: 带
    dt.frontend.*
    前缀的
    timeseries
    (趋势分析、告警)
  • 事件:
    fetch user.events
    (独立页面访问、请求、点击、错误)
  • 会话:
    fetch user.sessions
    (会话级聚合:时长、跳出、计数)

Quick Reference

快速参考

Common Metrics

常用指标

  • dt.frontend.user_action.count
    - User action volume
  • dt.frontend.user_action.duration
    - User action duration
  • dt.frontend.request.count
    - Request volume
  • dt.frontend.request.duration
    - Request latency (ms)
  • dt.frontend.error.count
    - Error counts
  • dt.frontend.session.active.estimated_count
    - Active sessions
  • dt.frontend.user.active.estimated_count
    - Unique users
  • dt.frontend.web.page.cumulative_layout_shift
    - CLS metric
  • dt.frontend.web.navigation.dom_interactive
    - DOM interactive time
  • dt.frontend.web.page.first_input_delay
    - FID metric (legacy; prefer INP)
  • dt.frontend.web.page.largest_contentful_paint
    - LCP metric
  • dt.frontend.web.page.interaction_to_next_paint
    - INP metric
  • dt.frontend.web.navigation.load_event_end
    - Load event end
  • dt.frontend.web.navigation.time_to_first_byte
    - Time to first byte
  • dt.frontend.user_action.count
    - 用户操作量
  • dt.frontend.user_action.duration
    - 用户操作耗时
  • dt.frontend.request.count
    - 请求量
  • dt.frontend.request.duration
    - 请求延迟(毫秒)
  • dt.frontend.error.count
    - 错误计数
  • dt.frontend.session.active.estimated_count
    - 活跃会话数
  • dt.frontend.user.active.estimated_count
    - 独立用户数
  • dt.frontend.web.page.cumulative_layout_shift
    - CLS指标
  • dt.frontend.web.navigation.dom_interactive
    - DOM可交互时间
  • dt.frontend.web.page.first_input_delay
    - FID指标(旧版,推荐使用INP)
  • dt.frontend.web.page.largest_contentful_paint
    - LCP指标
  • dt.frontend.web.page.interaction_to_next_paint
    - INP指标
  • dt.frontend.web.navigation.load_event_end
    - 加载事件结束时间
  • dt.frontend.web.navigation.time_to_first_byte
    - 首字节时间

Common Filters

常用过滤器

  • frontend.name
    - Filter by frontend name (e.g.
    my-frontend
    )
  • dt.rum.user_type
    - Exclude synthetic monitoring
  • geo.country.iso_code
    - Geographic filtering
  • device.type
    - Mobile, desktop, tablet
  • browser.name
    - Browser filtering
  • frontend.name
    - 按前端名称过滤(例如
    my-frontend
  • dt.rum.user_type
    - 排除合成监控流量
  • geo.country.iso_code
    - 地理过滤
  • device.type
    - 移动设备、桌面设备、平板
  • browser.name
    - 浏览器过滤

Common Timeseries Dimensions

常用时序维度

Use these for
dt.frontend.*
timeseries splits and breakdowns:
  • frontend.name
    - Frontend name
  • geo.country.iso_code
  • device.type
  • browser.name
  • os.name
  • user_type
    -
    real_user
    ,
    synthetic
    ,
    robot
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_page_summary == true
| summarize page_views = count(), by: {frontend.name}
| sort page_views desc
用于
dt.frontend.*
时序数据的拆分和拆解:
  • frontend.name
    - 前端名称
  • geo.country.iso_code
  • device.type
  • browser.name
  • os.name
  • user_type
    -
    real_user
    (真实用户)、
    synthetic
    (合成监控)、
    robot
    (机器人)
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_page_summary == true
| summarize page_views = count(), by: {frontend.name}
| sort page_views desc

Event Characteristics

事件特征

  • characteristics.has_page_summary
    - Page views (web)
  • characteristics.has_view_summary
    - Views (mobile)
  • characteristics.has_navigation
    - Navigation events
  • characteristics.has_user_interaction
    - Clicks, forms, etc.
  • characteristics.has_request
    - Network request events
  • characteristics.has_error
    - Error events
  • characteristics.has_crash
    - Mobile crashes
  • characteristics.has_long_task
    - Long JavaScript tasks
  • characteristics.has_csp_violation
    - CSP violations
  • characteristics.has_page_summary
    - 页面访问(网页端)
  • characteristics.has_view_summary
    - 页面浏览(移动端)
  • characteristics.has_navigation
    - 导航事件
  • characteristics.has_user_interaction
    - 点击、表单提交等
  • characteristics.has_request
    - 网络请求事件
  • characteristics.has_error
    - 错误事件
  • characteristics.has_crash
    - 移动应用崩溃
  • characteristics.has_long_task
    - 长JavaScript任务
  • characteristics.has_csp_violation
    - CSP违规

Session Data (
user.sessions
)

会话数据(
user.sessions

user.sessions
contains session-level aggregates produced by the session aggregation service from
user.events
. Field names differ from
user.events
— sessions use underscores where events use dots.
Session identity and context:
  • dt.rum.session.id
    — Session ID (NOT
    dt.rum.session_id
    )
  • dt.rum.instance.id
    — Instance ID
  • frontend.name
    - array of frontends involved in session
  • dt.rum.application.type
    web
    or
    mobile
  • dt.rum.user_type
    real_user
    ,
    synthetic
    , or
    robot
Session aggregates (underscore naming — NOT dot):
FieldDescription⚠️ NOT this
navigation_count
Number of navigations
navigation.count
user_interaction_count
Clicks, form submissions
user_interaction.count
user_action_count
User actions
user_action.count
request_count
XHR/fetch requests
request.count
event_count
Total events in session
event.count
page_summary_count
Page views (web)
page_summary.count
view_summary_count
Views (mobile/SPA)
view_summary.count
Error fields (dot naming — same as events):
  • error.count
    ,
    error.exception_count
    ,
    error.http_4xx_count
    ,
    error.http_5xx_count
  • error.anr_count
    ,
    error.csp_violation_count
    ,
    error.has_crash
Session lifecycle:
  • start_time
    ,
    end_time
    ,
    duration
    (nanoseconds)
  • end_reason
    timeout
    ,
    synthetic_execution_finished
    , etc.
  • characteristics.is_bounce
    — Boolean bounce flag
  • characteristics.has_replay
    — Session replay available
User identity:
  • dt.rum.user_tag
    — User identifier (typically email, username or customerId), set via
    dtrum.identifyUser()
    API call in the instrumented frontend. Not always populated — only present when the frontend explicitly calls
    identifyUser()
    .
  • When
    dt.rum.user_tag
    is empty,
    dt.rum.instance.id
    is often the only user differentiator. The value is a random ID assigned by the RUM agent on the client side, so it is not personally identifiable but can be used to distinguish unique users when
    user_tag
    is not set. On web this is based on a persistent cookie, so it can be deleted by the user.
  • The user tag is a session-level field — query it from
    user.sessions
    , not
    user.events
    (where it may be empty even if the session has one).
Client/device context:
  • browser.name
    ,
    browser.version
    ,
    device.type
    ,
    os.name
  • geo.country.iso_code
    ,
    client.ip
    ,
    client.isp
Synthetic-only fields:
  • dt.entity.synthetic_test
    ,
    dt.entity.synthetic_location
    ,
    dt.entity.synthetic_test_step
Time window behavior:
  • fetch user.sessions, from: X, to: Y
    only returns sessions that started in
    [X, Y]
    — NOT sessions that were merely active during that window.
  • Sessions can last 8h+ (the aggregation service waits 30+ minutes of inactivity before closing a session).
  • To find all sessions active during a time window, extend the lookback by at least 8 hours: e.g., to cover events from the last 24h, query
    fetch user.sessions, from: now() - 32h
    .
  • This matters for correlation queries (e.g., matching
    user.events
    to
    user.sessions
    by session ID) — a narrow
    user.sessions
    window will miss long-running sessions and produce false "orphans."
Session creation delay:
  • The session aggregation service waits for ~30+ minutes of inactivity before closing a session and writing the
    user.sessions
    record.
  • This means recent events (last ~1 hour) will not yet have a matching
    user.sessions
    entry
    — this is normal, not a data gap.
  • When correlating
    user.events
    with
    user.sessions
    , exclude recent data (e.g., use
    to: now() - 1h
    ) to avoid counting in-progress sessions as orphans.
Zombie sessions (events without a
user.sessions
record):
  • Not every
    dt.rum.session.id
    in
    user.events
    will have a corresponding
    user.sessions
    record. The session aggregation service intentionally skips zombie sessions — sessions with no real user activity (zero navigations and zero user interactions).
  • Zombie sessions contain only background, machine-driven activity (e.g., automatic XHR requests, heartbeats) with no page views or clicks. Serializing them would add no value to users.
  • When correlating
    user.events
    with
    user.sessions
    , expect a large number of unmatched session IDs. This is by design, not a data gap. Filter to sessions with activity before diagnosing orphans:
    dql
    fetch user.events, from: now() - 2h, to: now() - 1h
    | filter isNotNull(dt.rum.session.id)
    | summarize navs = countIf(characteristics.has_navigation == true),
        interactions = countIf(characteristics.has_user_interaction == true),
        by: {dt.rum.session.id}
    | filter navs > 0 or interactions > 0
Example — bounce rate and session quality:
dql
fetch user.sessions, from: now() - 24h
| filter dt.rum.user_type == "real_user"
| summarize
    total_sessions = count(),
    bounces = countIf(characteristics.is_bounce == true),
    zero_activity = countIf(toLong(navigation_count) == 0 and toLong(user_interaction_count) == 0),
    avg_duration_s = avg(toLong(duration)) / 1000000000
| fieldsAdd bounce_rate_pct = round((bounces * 100.0) / total_sessions, decimals: 1)
user.sessions
包含会话聚合服务从
user.events
生成的会话级聚合数据。字段命名与
user.events
不同
——会话字段使用下划线分隔,事件字段使用点分隔。
会话标识与上下文:
  • dt.rum.session.id
    — 会话ID(不是
    dt.rum.session_id
  • dt.rum.instance.id
    — 实例ID
  • frontend.name
    - 会话涉及的前端数组
  • dt.rum.application.type
    web
    (网页端)或
    mobile
    (移动端)
  • dt.rum.user_type
    real_user
    (真实用户)、
    synthetic
    (合成监控)或
    robot
    (机器人)
会话聚合字段(下划线命名,不是点分隔):
字段描述⚠️ 不要使用该字段
navigation_count
导航次数
navigation.count
user_interaction_count
点击、表单提交次数
user_interaction.count
user_action_count
用户操作次数
user_action.count
request_count
XHR/fetch请求次数
request.count
event_count
会话内总事件数
event.count
page_summary_count
页面访问量(网页端)
page_summary.count
view_summary_count
页面浏览量(移动端/单页应用)
view_summary.count
错误字段(点分隔命名,与事件一致):
  • error.count
    error.exception_count
    error.http_4xx_count
    error.http_5xx_count
  • error.anr_count
    error.csp_violation_count
    error.has_crash
会话生命周期:
  • start_time
    end_time
    duration
    (纳秒)
  • end_reason
    timeout
    (超时)、
    synthetic_execution_finished
    (合成监控执行完成)等
  • characteristics.is_bounce
    — 跳出标识布尔值
  • characteristics.has_replay
    — 是否有会话回放
用户标识:
  • dt.rum.user_tag
    — 用户标识符(通常为邮箱、用户名或客户ID),通过埋点前端的
    dtrum.identifyUser()
    API调用设置。并非总是存在——仅当前端显式调用
    identifyUser()
    时才会填充。
  • dt.rum.user_tag
    为空时,
    dt.rum.instance.id
    通常是唯一的用户区分标识。该值是RUM Agent在客户端分配的随机ID,因此不包含个人可识别信息,但在
    user_tag
    未设置时可用于区分独立用户。网页端该值基于持久化Cookie生成,因此可能被用户删除。
  • 用户标签是会话级字段——从
    user.sessions
    中查询,不要从
    user.events
    中查询(即使会话存在用户标签,事件中的该字段也可能为空)。
客户端/设备上下文:
  • browser.name
    browser.version
    device.type
    os.name
  • geo.country.iso_code
    client.ip
    client.isp
仅合成监控可用字段:
  • dt.entity.synthetic_test
    dt.entity.synthetic_location
    dt.entity.synthetic_test_step
时间窗口行为:
  • fetch user.sessions, from: X, to: Y
    仅返回
    [X, Y]
    区间内启动
    的会话——不返回仅在该窗口内活跃的会话。
  • 会话可持续8小时以上(聚合服务会等待30分钟以上无活动后才会关闭会话)。
  • 要查找某个时间窗口内所有活跃的会话,至少将回溯时间延长8小时:例如要覆盖过去24小时的事件,查询
    fetch user.sessions, from: now() - 32h
  • 这一点对关联查询非常重要(例如按会话ID匹配
    user.events
    user.sessions
    )——狭窄的
    user.sessions
    查询窗口会遗漏长时间运行的会话,产生虚假的“孤立”记录。
会话创建延迟:
  • 会话聚合服务会等待约30分钟以上无活动后才会关闭会话并写入
    user.sessions
    记录。
  • 这意味着最近约1小时的事件还没有对应的
    user.sessions
    条目
    ——这是正常现象,不是数据缺口。
  • 关联
    user.events
    user.sessions
    时,排除最近的数据(例如使用
    to: now() - 1h
    ),避免将进行中的会话计为孤立记录。
僵尸会话(没有
user.sessions
记录的事件):
  • 并非
    user.events
    中的所有
    dt.rum.session.id
    都有对应的
    user.sessions
    记录。会话聚合服务会故意跳过僵尸会话——即没有真实用户活动的会话(零次导航和零次用户交互)。
  • 僵尸会话仅包含后台机器驱动的活动(例如自动XHR请求、心跳),没有页面访问或点击。序列化这类会话对用户没有价值。
  • 关联
    user.events
    user.sessions
    时,会存在大量未匹配的会话ID,这是设计如此,不是数据缺口。诊断孤立记录前先过滤出有活动的会话:
    dql
    fetch user.events, from: now() - 2h, to: now() - 1h
    | filter isNotNull(dt.rum.session.id)
    | summarize navs = countIf(characteristics.has_navigation == true),
        interactions = countIf(characteristics.has_user_interaction == true),
        by: {dt.rum.session.id}
    | filter navs > 0 or interactions > 0
示例——跳出率和会话质量:
dql
fetch user.sessions, from: now() - 24h
| filter dt.rum.user_type == "real_user"
| summarize
    total_sessions = count(),
    bounces = countIf(characteristics.is_bounce == true),
    zero_activity = countIf(toLong(navigation_count) == 0 and toLong(user_interaction_count) == 0),
    avg_duration_s = avg(toLong(duration)) / 1000000000
| fieldsAdd bounce_rate_pct = round((bounces * 100.0) / total_sessions, decimals: 1)

Performance Thresholds

性能阈值

  • LCP: Good <2.5s | Poor >4.0s
  • INP: Good <200ms | Poor >500ms
  • CLS: Good <0.1 | Poor >0.25
  • Cold Start: Good <3s | Poor >5s
  • Long Tasks: >50ms problematic, >250ms severe
  • LCP: 优秀 <2.5s | 较差 >4.0s
  • INP: 优秀 <200ms | 较差 >500ms
  • CLS: 优秀 <0.1 | 较差 >0.25
  • 冷启动: 优秀 <3s | 较差 >5s
  • 长任务: >50ms有问题,>250ms严重

Core Workflows

核心工作流

1. Web Performance Monitoring

1. 网页性能监控

Track Core Web Vitals, page performance, and request latency for SEO and UX optimization.
Primary Files:
  • references/WebVitals.md
    - Core Web Vitals (LCP, INP, CLS)
  • references/performance-analysis.md
    - Request and page performance
Common Queries:
  • All Core Web Vitals summary
  • Web Vitals by page/device
  • Request duration SLA monitoring
  • Page load performance trends
追踪Core Web Vitals、页面性能和请求延迟,用于SEO和UX优化。
主要文件:
  • references/WebVitals.md
    - Core Web Vitals(LCP、INP、CLS)
  • references/performance-analysis.md
    - 请求和页面性能
常用查询:
  • 所有Core Web Vitals汇总
  • 按页面/设备拆分的Web Vitals
  • 请求耗时SLA监控
  • 页面加载性能趋势

2. User Session & Behavior Analysis

2. 用户会话与行为分析

Understand user engagement, navigation patterns, and session characteristics. Analyze button clicks, form interactions, and user journeys.
Data source choice:
  • Use
    fetch user.sessions
    for session-level analysis (bounce rate, session duration, session counts)
  • Use
    fetch user.events
    for event-level detail (individual clicks, navigation timing, specific pages)
Primary Files:
  • references/user-sessions.md
    - Session tracking and user analytics
  • references/performance-analysis.md
    - Navigation and engagement patterns
Common Queries:
  • Active sessions by frontend
  • Sessions by custom property
  • Bounce rate analysis (use
    user.sessions
    with
    characteristics.is_bounce
    )
  • Session quality (zero-activity sessions via
    navigation_count
    ,
    user_interaction_count
    )
  • Click analysis on UI elements (use
    user.events
    with
    characteristics.has_user_interaction
    )
  • External referrers (traffic sources)
理解用户参与度、导航模式和会话特征。分析按钮点击、表单交互和用户旅程。
数据源选择:
  • 会话级分析使用
    fetch user.sessions
    (跳出率、会话时长、会话计数)
  • 事件级细节使用
    fetch user.events
    (独立点击、导航耗时、特定页面)
主要文件:
  • references/user-sessions.md
    - 会话追踪和用户分析
  • references/performance-analysis.md
    - 导航和参与度模式
常用查询:
  • 按前端拆分的活跃会话数
  • 按自定义属性拆分的会话数
  • 跳出率分析(使用
    user.sessions
    characteristics.is_bounce
    字段)
  • 会话质量(通过
    navigation_count
    user_interaction_count
    统计零活动会话)
  • UI元素点击分析(使用带
    characteristics.has_user_interaction
    过滤的
    user.events
  • 外部引荐来源(流量来源)

3. Error Tracking & Debugging

3. 错误追踪与调试

Monitor error rates, analyze exceptions, and correlate frontend issues with backend.
Primary Files:
  • references/error-tracking.md
    - Error analysis and debugging
  • references/performance-analysis.md
    - Trace correlation
Common Queries:
  • Error rate monitoring
  • JavaScript exceptions by type
  • Failed requests with backend traces
  • Request timing breakdown
监控错误率、分析异常、关联前端问题与后端链路。
主要文件:
  • references/error-tracking.md
    - 错误分析与调试
  • references/performance-analysis.md
    - 链路关联
常用查询:
  • 错误率监控
  • 按类型拆分的JavaScript异常
  • 关联后端链路的失败请求
  • 请求耗时拆解

4. Mobile Frontend Monitoring

4. 移动端前端监控

Track mobile app performance, startup times, and crash analytics for iOS and Android. Analyze app version performance and device-specific issues.
Primary Files:
  • references/mobile-monitoring.md
    - App starts, crashes, and mobile-specific metrics
Common Queries:
  • Cold start performance by app version (iOS, Android)
  • Warm start and hot start metrics
  • Crash rate by device model and OS version
  • ANR events (Android)
  • Native crash signals
  • App version comparison
追踪iOS和Android移动应用性能、启动时间和崩溃分析。分析应用版本性能和设备特定问题。
主要文件:
  • references/mobile-monitoring.md
    - 应用启动、崩溃和移动端特有指标
常用查询:
  • 按应用版本(iOS、Android)拆分的冷启动性能
  • 温启动和热启动指标
  • 按设备型号和OS版本拆分的崩溃率
  • ANR事件(Android)
  • 原生崩溃信号
  • 应用版本对比

5. Advanced Performance Optimization

5. 高级性能优化

Deep performance diagnostics including JavaScript profiling, main thread blocking, UI jank analysis, and geographic performance.
Primary Files:
  • references/performance-analysis.md
    - Advanced diagnostics and long tasks
Common Queries:
  • Long JavaScript tasks blocking main thread
  • UI jank and rendering delays
  • Tasks >50ms impacting responsiveness
  • Third-party long tasks (iframes)
  • Single-page app performance issues
  • Geographic performance distribution
  • Performance degradation detection
深度性能诊断,包括JavaScript profiling、主线程阻塞、UI卡顿分析和地域性能。
主要文件:
  • references/performance-analysis.md
    - 高级诊断和长任务分析
常用查询:
  • 阻塞主线程的长JavaScript任务
  • UI卡顿和渲染延迟
  • 影响响应性的>50ms任务
  • 第三方长任务(iframe)
  • 单页应用性能问题
  • 地域性能分布
  • 性能降级检测

Best Practices

最佳实践

  1. Use metrics for trends, events for debugging
    • Metrics: Timeseries dashboards, alerting, capacity planning
    • Events: Root cause analysis, detailed diagnostics
  2. Filter by frontend in multi-app environments
    • Always use
      frontend.name
      for clarity
  3. Match interval to time range
    • 5m intervals for hours, 1h for days, 1d for weeks
  4. Exclude synthetic traffic when analyzing real users
    • Filter
      dt.rum.user_type
      to focus on genuine behavior
  5. Combine metrics with events for complete insights
    • Start with metric trends, drill into events for details
  6. Extend
    user.sessions
    time window for correlation queries
    • user.sessions
      only returns sessions that started in the query window
    • Sessions can last 8h+, so extend lookback by at least 8h when joining with
      user.events
  1. 趋势分析用指标,调试用事件
    • 指标:时序大盘、告警、容量规划
    • 事件:根因分析、详细诊断
  2. 多应用环境下按前端过滤
    • 始终使用
      frontend.name
      保证查询清晰
  3. 时间间隔与时间范围匹配
    • 小时级查询用5分钟间隔,天级用1小时间隔,周级用1天间隔
  4. 分析真实用户时排除合成流量
    • 过滤
      dt.rum.user_type
      聚焦真实用户行为
  5. 结合指标与事件获得完整洞察
    • 从指标趋势开始,下钻到事件查看细节
  6. 关联查询时延长
    user.sessions
    时间窗口
    • user.sessions
      仅返回在查询窗口内启动的会话
    • 会话可持续8小时以上,因此与
      user.events
      关联时至少延长8小时回溯时间

Slow Page Load Playbook

页面加载缓慢排查手册

Start by segmenting the problem by page, browser, geo location, and
dt.rum.user_type
.
Heuristics:
  • High TTFB -> slow backend
  • High LCP with normal TTFB -> render bottleneck
  • High CLS -> layout shifts (late-loading content, ads, fonts)
  • Long tasks dominate -> JavaScript execution bottlenecks (heavy frameworks, large bundles)
首先按页面、浏览器、地理位置和
dt.rum.user_type
拆分问题。
判定规则:
  • TTFB高 -> 后端缓慢
  • TTFB正常但LCP高 -> 渲染瓶颈
  • CLS高 -> 布局偏移(内容晚加载、广告、字体)
  • 长任务占比高 -> JavaScript执行瓶颈(重型框架、大包体积)

Backend latency (high TTFB)

后端延迟(高TTFB)

dql
fetch user.events
| filter frontend.name == "my-frontend" and characteristics.has_request == true
| filter page.url.path == "/checkout"
| summarize avg_ttfb = avg(request.time_to_first_byte), avg_duration = avg(duration)
If TTFB is high, analyze backend spans by correlating frontend events with backend traces using
dt.rum.trace_id
.
dql
fetch user.events
| filter frontend.name == "my-frontend" and characteristics.has_request == true
| filter page.url.path == "/checkout"
| summarize avg_ttfb = avg(request.time_to_first_byte), avg_duration = avg(duration)
如果TTFB很高,通过
dt.rum.trace_id
关联前端事件和后端链路,分析后端span。

Heavy JavaScript execution (long tasks)

JavaScript执行过重(长任务)

Long tasks by page:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {frontend.name, page.url.path}
| sort total_blocking_time desc
| limit 20
Long tasks by script source:
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {long_task.attribution.container_src}
| sort total_blocking_time desc
| limit 20
按页面统计长任务:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {frontend.name, page.url.path}
| sort total_blocking_time desc
| limit 20
按脚本来源统计长任务:
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_long_task == true
| summarize
   long_task_count = count(),
   total_blocking_time = sum(duration),
   by: {long_task.attribution.container_src}
| sort total_blocking_time desc
| limit 20

Large JavaScript bundles

JavaScript包体积过大

dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| filter endsWith(url.full, ".js")
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20
dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| filter endsWith(url.full, ".js")
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20

Large resources

资源体积过大

dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20
dql
fetch user.events
| filter frontend.name == "my-frontend"
| filter characteristics.has_request
| summarize dls = max(performance.decoded_body_size), by: url.full
| sort dls desc
| limit 20

Cache effectiveness

缓存效率

dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_request == true
| fieldsAdd cache_status = if(
   performance.incomplete_reason == "local_cache" or performance.transfer_size == 0 and
   (performance.encoded_body_size > 0 or performance.decoded_body_size > 0),
   "cached",
   else: if(performance.transfer_size > 0, "network", else: "uncached")
  )
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   by: {url.domain, cache_status}
dql
fetch user.events, from: now() - 2h
| filter frontend.name == "my-frontend"
| filter characteristics.has_request == true
| fieldsAdd cache_status = if(
   performance.incomplete_reason == "local_cache" or performance.transfer_size == 0 and
   (performance.encoded_body_size > 0 or performance.decoded_body_size > 0),
   "cached",
   else: if(performance.transfer_size > 0, "network", else: "uncached")
  )
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   by: {url.domain, cache_status}

Compression waste

压缩浪费

dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.encoded_body_size) and isNotNull(performance.decoded_body_size)
| filter performance.encoded_body_size > 0
| fieldsAdd
   expansion_ratio = performance.decoded_body_size / performance.encoded_body_size,
   wasted_bytes = performance.decoded_body_size - performance.encoded_body_size
| summarize
   requests = count(),
   avg_expansion_ratio = avg(expansion_ratio),
   total_wasted_bytes = sum(wasted_bytes),
   by: {request.url.host, request.url.path}
| sort total_wasted_bytes desc
| limit 50
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.encoded_body_size) and isNotNull(performance.decoded_body_size)
| filter performance.encoded_body_size > 0
| fieldsAdd
   expansion_ratio = performance.decoded_body_size / performance.encoded_body_size,
   wasted_bytes = performance.decoded_body_size - performance.encoded_body_size
| summarize
   requests = count(),
   avg_expansion_ratio = avg(expansion_ratio),
   total_wasted_bytes = sum(wasted_bytes),
   by: {request.url.host, request.url.path}
| sort total_wasted_bytes desc
| limit 50

Network issues

网络问题

Compare by location and domain when TTFB is high but backend performance is good:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {geo.country.iso_code, request.url.domain}
| sort p95_duration desc
| limit 50
Analyze DNS time:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.domain_lookup_start) and isNotNull(performance.domain_lookup_end)
| fieldsAdd dns_ms = performance.domain_lookup_end - performance.domain_lookup_start
| summarize
   request_count = count(),
   avg_dns_ms = avg(dns_ms),
   p75_dns_ms = percentile(dns_ms, 75),
   p95_dns_ms = percentile(dns_ms, 95),
   by: {request.url.domain}
| sort p95_dns_ms desc
| limit 50
Analyze by protocol (http/1.1, h2, h3):
dql
fetch user.events
| filter characteristics.has_request
| summarize cnt = count(), by: {url.domain, performance.next_hop_protocol}
| sort cnt desc
| limit 50
当TTFB很高但后端性能良好时,按地域和域名对比:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {geo.country.iso_code, request.url.domain}
| sort p95_duration desc
| limit 50
分析DNS耗时:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| filter isNotNull(performance.domain_lookup_start) and isNotNull(performance.domain_lookup_end)
| fieldsAdd dns_ms = performance.domain_lookup_end - performance.domain_lookup_start
| summarize
   request_count = count(),
   avg_dns_ms = avg(dns_ms),
   p75_dns_ms = percentile(dns_ms, 75),
   p95_dns_ms = percentile(dns_ms, 95),
   by: {request.url.domain}
| sort p95_dns_ms desc
| limit 50
按协议分析(http/1.1、h2、h3):
dql
fetch user.events
| filter characteristics.has_request
| summarize cnt = count(), by: {url.domain, performance.next_hop_protocol}
| sort cnt desc
| limit 50

Third-party dependencies

第三方依赖

Analyze request performance by domain:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {request.url.domain}
| sort p95_duration desc
| limit 50
按域名分析请求性能:
dql
fetch user.events, from: now() - 2h
| filter characteristics.has_request == true
| summarize
   request_count = count(),
   avg_duration = avg(duration),
   p75_duration = percentile(duration, 75),
   p95_duration = percentile(duration, 95),
   by: {request.url.domain}
| sort p95_duration desc
| limit 50

Troubleshooting

故障排查

Handling Zero Results

处理零结果

When queries return no data, follow this diagnostic workflow:
  1. Validate Timeframe
    • Check if timeframe is appropriate for the data type
    • RUM data may have delay (1-2 minutes for recent events)
    • Verify timeframe syntax:
      now()-1h to now()
      or similar
    • Try expanding timeframe:
      now()-24h
      for initial exploration
  2. Verify frontend Configuration
    • Confirm frontend is instrumented and sending RUM data
    • Check
      frontend.name
      filter is correct
    • Test without frontend filter to see if any RUM data exists
    • Verify frontend name matches the environment
  3. Check Data Availability
    • Run basic query:
      fetch user.events | limit 1
    • If no events exist, RUM may not be configured
    • Check if timeframe predates frontend deployment
    • Verify user has access to the environment
  4. Review Query Syntax
    • Validate filters aren't too restrictive
    • Check for typos in field names or metric names
    • Test query incrementally: start simple, add filters gradually
    • Verify characteristics filters match event types
When to Ask User for Clarification:
  • No RUM data exists in environment → "Is RUM configured for this frontend?"
  • Timeframe unclear → "What time period should I analyze?"
  • Expected data missing → "Has this frontend sent data recently?"
当查询无返回数据时,遵循以下诊断流程:
  1. 验证时间范围
    • 检查时间范围是否适配数据类型
    • RUM数据可能有延迟(最近的事件有1-2分钟延迟)
    • 验证时间范围语法:
      now()-1h to now()
      或类似格式
    • 尝试扩大时间范围:初始探索使用
      now()-24h
  2. 验证前端配置
    • 确认前端已埋点并发送RUM数据
    • 检查
      frontend.name
      过滤器是否正确
    • 去掉前端过滤器测试是否存在任何RUM数据
    • 验证前端名称与环境匹配
  3. 检查数据可用性
    • 运行基础查询:
      fetch user.events | limit 1
    • 如果没有事件,可能未配置RUM
    • 检查时间范围是否早于前端部署时间
    • 验证用户有权限访问该环境
  4. 检查查询语法
    • 验证过滤器是否过于严格
    • 检查字段名或指标名是否有拼写错误
    • 增量测试查询:从简单查询开始,逐步添加过滤器
    • 验证特征过滤器与事件类型匹配
何时向用户请求澄清:
  • 环境中不存在RUM数据 → "该前端是否配置了RUM?"
  • 时间范围不明确 → "我应该分析哪个时间段?"
  • 预期数据缺失 → "该前端最近是否有发送数据?"

Handling Anomalous Results

处理异常结果

When query results seem unexpected or suspicious:
Unexpected High Values:
  • Metric spikes: Verify interval aggregation (avg vs. max vs. sum)
  • Session counts: Check for bot traffic or synthetic monitoring
  • Error rates: Confirm error definition matches expectations
  • Performance degradation: Look for deployment or infrastructure changes
Unexpected Low Values:
  • Missing sessions: Verify
    dt.rum.user_type
    filter isn't excluding real users
  • Low request counts: Check if frontend filter is too narrow
  • Few errors: Confirm error characteristics filter is correct
  • Missing mobile data: Verify platform-specific fields exist
Inconsistent Data:
  • Metrics vs. Events mismatch: Different aggregation methods are expected
  • Geographic anomalies: Check timezone assumptions
  • Device distribution skew: May reflect actual user base
  • Version mismatches: Verify app version filtering logic
当查询结果看起来不符合预期或可疑时:
异常高值:
  • 指标突增:验证区间聚合方式(平均值vs最大值vs求和)
  • 会话计数异常:检查是否有机器人流量或合成监控
  • 错误率异常:确认错误定义与预期一致
  • 性能降级:排查部署或基础设施变更
异常低值:
  • 会话缺失:验证
    dt.rum.user_type
    过滤器是否排除了真实用户
  • 请求计数低:检查前端过滤器是否过窄
  • 错误很少:确认错误特征过滤器是否正确
  • 移动端数据缺失:验证平台特有字段是否存在
数据不一致:
  • 指标与事件不匹配:聚合方式不同属于预期情况
  • 地理异常:检查时区假设
  • 设备分布倾斜:可能反映真实用户群体特征
  • 版本不匹配:验证应用版本过滤逻辑

Decision Tree: Ask vs. Investigate

决策树:询问vs调查

Query returns unexpected results
├─ Is this a zero-result scenario?
│  ├─ YES → Follow "Handling Zero Results" workflow
│  └─ NO → Continue
├─ Can I validate the result independently?
│  ├─ YES → Run validation query
│  │        ├─ Validation confirms result → Report findings
│  │        └─ Validation contradicts → Investigate further
│  └─ NO → Continue
├─ Is the anomaly clearly explained by data?
│  ├─ YES → Report with explanation
│  └─ NO → Continue
├─ Do I need domain knowledge to interpret?
│  ├─ YES → Ask user for context
│  │        Example: "The error rate is 15%. Is this expected for your frontend?"
│  └─ NO → Continue
└─ Is the issue ambiguous or requires clarification?
   ├─ YES → Ask specific question with data context
   │        Example: "I see two frontends named 'web-app'. Which frontend name should I use?"
   └─ NO → Investigate and report findings with caveats
查询返回异常结果
├─ 是否是零结果场景?
│  ├─ 是 → 遵循"处理零结果"流程
│  └─ 否 → 继续
├─ 我能否独立验证结果?
│  ├─ 是 → 运行验证查询
│  │        ├─ 验证确认结果 → 上报发现
│  │        └─ 验证结果矛盾 → 进一步调查
│  └─ 否 → 继续
├─ 异常是否能被数据清晰解释?
│  ├─ 是 → 附带解释上报
│  └─ 否 → 继续
├─ 我是否需要领域知识来解读?
│  ├─ 是 → 向用户请求上下文
│  │        示例:"错误率为15%,这对您的前端来说是否属于预期情况?"
│  └─ 否 → 继续
└─ 问题是否模糊或需要澄清?
   ├─ 是 → 结合数据上下文提出具体问题
   │        示例:"我发现有两个名为'web-app'的前端,我应该使用哪个前端名称?"
   └─ 否 → 调查并附带说明上报结果

Common Investigation Steps

常用调查步骤

For Performance Issues:
  1. Compare to baseline: Query same metric for previous week
  2. Segment by dimension: Break down by device, browser, geography
  3. Check for outliers: Use percentiles (p50, p95, p99) vs. averages
  4. Correlate with deployments: Filter by app version or time windows
For Data Availability Issues:
  1. Start broad: Query all RUM data without filters
  2. Add filters incrementally: Isolate which filter eliminates data
  3. Check related metrics: If events missing, try timeseries
  4. Validate entity relationships: Confirm frontend-to-service links
For Unexpected Patterns:
  1. Expand timeframe: Look for historical context
  2. Cross-reference data sources: Compare events and metrics
  3. Check sampling: Verify no sampling is affecting results
  4. Consider external factors: Holidays, outages, traffic changes
性能问题排查:
  1. 与基线对比:查询上一周的相同指标
  2. 按维度拆分:按设备、浏览器、地域拆解
  3. 检查异常值:使用分位数(p50、p95、p99)而非平均值
  4. 与部署关联:按应用版本或时间窗口过滤
数据可用性问题排查:
  1. 从宽泛查询开始:不加过滤器查询所有RUM数据
  2. 增量添加过滤器:定位哪个过滤器排除了数据
  3. 检查相关指标:如果事件缺失,尝试查询时序数据
  4. 验证实体关系:确认前端与服务的关联
异常模式排查:
  1. 扩大时间范围:查看历史上下文
  2. 交叉对比数据源:对比事件和指标
  3. 检查采样:确认无采样影响结果
  4. 考虑外部因素:节假日、故障、流量变化

Red Flags: When to Stop and Ask

红色警报:何时停止并询问用户

Always ask the user when:
  • ❌ No RUM data exists anywhere in the environment
  • ❌ Multiple frontends match the user's description
  • ❌ Results contradict user's stated expectations explicitly
  • ❌ Data suggests monitoring is misconfigured
  • ❌ Query requires business context (e.g., "acceptable error rate")
  • ❌ Timeframe is ambiguous and affects interpretation significantly
Example clarifying questions:
  • "I found two frontends named 'checkout'. Which one:
    checkout-web
    or
    checkout-mobile
    ?"
  • "The query returns 0 results for the past hour. Should I expand the timeframe, or do you expect real-time data?"
  • "The average LCP is 8 seconds, which exceeds the 4-second threshold. Is this frontend known to have performance issues?"
  • "I see only synthetic traffic. Should I include
    dt.rum.user_type='REAL_USER'
    to focus on real users?"
出现以下情况时始终询问用户:
  • ❌ 环境中完全不存在RUM数据
  • ❌ 多个前端匹配用户的描述
  • ❌ 结果明确与用户的预期矛盾
  • ❌ 数据显示监控配置错误
  • ❌ 查询需要业务上下文(例如"可接受的错误率")
  • ❌ 时间范围模糊且会显著影响解读
澄清问题示例:
  • "我找到了两个名为'checkout'的前端,应该用哪个:
    checkout-web
    还是
    checkout-mobile
    ?"
  • "过去一小时的查询返回0结果,我应该扩大时间范围,还是您期望查询实时数据?"
  • "平均LCP为8秒,超过了4秒的阈值,该前端是否已知存在性能问题?"
  • "我仅看到合成监控流量,是否需要添加
    dt.rum.user_type='REAL_USER'
    过滤聚焦真实用户?"

When to Use This Skill

何时使用本Skill

Use frontend-observability skill when:
  • Monitoring web or mobile frontend performance
  • Analyzing Core Web Vitals for SEO
  • Tracking user sessions, engagement, or behavior
  • Analyzing click events and button interactions
  • Debugging frontend errors or slow requests
  • Correlating frontend issues with backend traces
  • Optimizing mobile app startup or crash rates (iOS, Android)
  • Analyzing app version performance
  • Diagnosing UI jank and main thread blocking
  • Analyzing security compliance (CSP violations)
  • Profiling JavaScript performance (long tasks)
Do NOT use for:
  • Backend service monitoring (use services skill)
  • Infrastructure metrics (use infrastructure skill)
  • Log analysis (use logs skill)
  • Business process monitoring (use business-events skill)
符合以下场景时使用前端可观测性Skill:
  • 监控网页或移动端前端性能
  • 分析Core Web Vitals用于SEO优化
  • 追踪用户会话、参与度或行为
  • 分析点击事件和按钮交互
  • 调试前端错误或缓慢请求
  • 关联前端问题与后端链路
  • 优化移动应用启动速度或崩溃率(iOS、Android)
  • 分析应用版本性能
  • 诊断UI卡顿和主线程阻塞
  • 分析安全合规性(CSP违规)
  • profiling JavaScript性能(长任务)
请勿用于以下场景:
  • 后端服务监控(使用服务Skill)
  • 基础设施指标(使用基础设施Skill)
  • 日志分析(使用日志Skill)
  • 业务流程监控(使用业务事件Skill)

Progressive Disclosure

渐进式披露

Always Available

始终可用

  • FrontendBasics.md - RUM fundamentals and quick reference
  • FrontendBasics.md - RUM基础和快速参考

Loaded by Workflow

按工作流加载

  • Web Performance: WebVitals.md, performance-analysis.md
  • User Behavior: user-sessions.md, performance-analysis.md
  • Error Analysis: error-tracking.md, performance-analysis.md
  • Mobile Apps: mobile-monitoring.md
  • 网页性能:WebVitals.md、performance-analysis.md
  • 用户行为:user-sessions.md、performance-analysis.md
  • 错误分析:error-tracking.md、performance-analysis.md
  • 移动应用:mobile-monitoring.md

Load on Explicit Request

显式请求时加载

  • Advanced diagnostics (long tasks, user actions)
  • Security compliance (CSP violations, visibility tracking)
  • Specialized mobile features (platform-specific phases)
  • 高级诊断(长任务、用户操作)
  • 安全合规(CSP违规、可见性追踪)
  • 移动端特有功能(平台特有阶段)

Reference Files

参考文件

Core Reference Documents

核心参考文档

  • references/WebVitals.md
    - Core Web Vitals monitoring
  • references/user-sessions.md
    - Session and user analytics
  • references/error-tracking.md
    - Error analysis and debugging
  • references/mobile-monitoring.md
    - Mobile app performance and crashes
  • references/performance-analysis.md
    - Advanced performance diagnostics
  • references/WebVitals.md
    - Core Web Vitals监控
  • references/user-sessions.md
    - 会话和用户分析
  • references/error-tracking.md
    - 错误分析与调试
  • references/mobile-monitoring.md
    - 移动应用性能和崩溃
  • references/performance-analysis.md
    - 高级性能诊断