dd-audit-cost-spike-investigation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audit Trail: Cost / Usage Spike Investigation

审计跟踪:成本/使用量激增调查

Identify what caused a Datadog usage spike by correlating billing data with configuration change history.
The causal chain is: someone changed something → that change increased data volume → usage spiked → cost went up. Usage Metering tells you when and what; Audit Trail tells you who made the change.
通过将账单数据与配置变更历史相关联,找出Datadog使用量激增的原因。
因果链为:有人进行了变更 → 该变更导致数据量增加 → 使用量激增 → 成本上升。使用计量可告知您何时、哪些指标出现峰值;审计跟踪可告知您谁进行了变更。

Prerequisites

前提条件

bash
pup auth login   # OAuth2 (recommended) — covers audit queries
bash
pup auth login   # OAuth2 (推荐) — 支持审计查询

Usage Metering queries also need DD_API_KEY + DD_APP_KEY

使用计量查询还需要DD_API_KEY + DD_APP_KEY

export DD_API_KEY=<your-api-key> export DD_APP_KEY=<your-app-key> export DD_SITE=datadoghq.com
undefined
export DD_API_KEY=<your-api-key> export DD_APP_KEY=<your-app-key> export DD_SITE=datadoghq.com
undefined

Scope Boundary

范围边界

This skill identifies configuration changes that may have caused a spike. It does not identify which specific user or process submitted the data (e.g., which service sent the LLM spans). For per-submission attribution, use LLM Observability traces or APM instrumentation.
本技能可识别可能导致激增的配置变更。它无法识别提交数据的具体用户或流程(例如,哪个服务发送了LLM追踪数据)。如需按提交来源归因,请使用LLM可观测性追踪或APM工具。

Investigation Workflow

调查流程

Step 1 — Identify the spike window and product family

步骤1 — 确定激增时间段和产品类别

bash
START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  --data-urlencode "filter[timestamp][start]=${START}" \
  --data-urlencode "filter[timestamp][end]=${END}" \
  --data-urlencode "filter[product_families]=all" \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      product: .attributes.product_family,
      measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
    }]'
Product families with LLM/AI coverage:
llm_observability
,
bits_ai
,
logs
,
apm
bash
START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  --data-urlencode "filter[timestamp][start]=${START}" \
  --data-urlencode "filter[timestamp][end]=${END}" \
  --data-urlencode "filter[product_families]=all" \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      product: .attributes.product_family,
      measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
    }]'
包含LLM/AI覆盖的产品类别:
llm_observability
,
bits_ai
,
logs
,
apm

Step 2 — Pinpoint the spike

步骤2 — 精确定位激增点

From Step 1, identify the hour/day where volume jumped. Note the timestamp as
SPIKE_TIME
.
从步骤1的结果中,找出数据量激增的小时/日期。将该时间戳记为
SPIKE_TIME

Step 3 — Search Audit Trail for config changes in the 24h preceding the spike

步骤3 — 在审计跟踪中搜索激增前24小时内的配置变更

bash
pup audit-logs search \
  --query "@action:(created OR modified OR deleted)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 200 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      actor_type: .attributes.attributes.evt.actor.type,
      action: .attributes.attributes.action,
      event_category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'
Note:
--from
and
--to
accept ISO timestamps (e.g.,
2026-05-01T14:00:00Z
) or relative values (
1h
,
24h
,
7d
).
bash
pup audit-logs search \
  --query "@action:(created OR modified OR deleted)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 200 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      actor_type: .attributes.attributes.evt.actor.type,
      action: .attributes.attributes.action,
      event_category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'
注意:
--from
--to
接受ISO时间戳(例如
2026-05-01T14:00:00Z
)或相对值(
1h
,
24h
,
7d
)。

Step 4 — Narrow to product-relevant config changes

步骤4 — 筛选与激增产品相关的配置变更

Filter to the audit categories most likely to affect the spiking product:
If this product spikedAdd to query
llm_observability
@evt.name:(Integration OR APM OR "Log Management")
logs
/
indexed_logs
@evt.name:"Log Management" @asset.type:(pipeline OR index OR exclusion_filter)
apm
/
indexed_spans
@evt.name:APM @asset.type:(retention_filter OR sampling_rate)
rum
@evt.name:RUM
metrics
@evt.name:Metrics
Example for LLM Observability spike:
bash
pup audit-logs search \
  --query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 100 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      action: .attributes.attributes.action,
      category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'
过滤出最可能影响激增产品的审计类别:
如果该产品出现激增添加到查询条件
llm_observability
@evt.name:(Integration OR APM OR "Log Management")
logs
/
indexed_logs
@evt.name:"Log Management" @asset.type:(pipeline OR index OR exclusion_filter)
apm
/
indexed_spans
@evt.name:APM @asset.type:(retention_filter OR sampling_rate)
rum
@evt.name:RUM
metrics
@evt.name:Metrics
LLM可观测性激增示例:
bash
pup audit-logs search \
  --query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 100 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      action: .attributes.attributes.action,
      category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'

Output Format

输出格式

Usage spike detected:
  Product: <product_family>
  Spike time: <SPIKE_TIME>
  Volume: <baseline> → <spike_value> (<magnitude>×)

Configuration changes in 24h preceding spike:
  <timestamp> | <user_email> | <action> <resource_type> <resource_id> | <category>

Likely causal change: <most-proximate change matching the product family>

Confidence: HIGH (single clear change) / MEDIUM (multiple candidates) / LOW (no matching changes)

Next steps:
  - Confirm with <user_email> whether the change was intentional
  - If unintentional: revert <resource_id> and monitor volume
  - If intentional: update cost forecasts and alert thresholds
检测到使用量激增:
  产品:<product_family>
  激增时间:<SPIKE_TIME>
  数据量:<基准值> → <峰值>(<倍数>倍)

激增前24小时内的配置变更:
  <时间戳> | <用户邮箱> | <操作> <资源类型> <资源ID> | <类别>

可能的因果变更:<与产品类别匹配的最接近的变更>

置信度:高(单一明确变更)/ 中(多个候选变更)/ 低(无匹配变更)

后续步骤:
  - 与<用户邮箱>确认该变更是否为有意操作
  - 若为无意操作:还原<资源ID>并监控数据量
  - 若为有意操作:更新成本预测和告警阈值

When No Causal Change Is Found

未找到因果变更时

  1. The change may predate the 24h window — expand to 72h
  2. The increase may be from application-side instrumentation changes — check deploys
  3. The increase may be organic traffic growth — correlate with product launch or traffic event
  1. 变更可能早于24小时窗口——将时间范围扩大到72小时
  2. 数据量增加可能来自应用端工具变更——检查部署记录
  3. 数据量增加可能是自然流量增长——与产品发布或流量事件关联分析

References

参考资料