dd-audit-cost-spike-investigation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audit Trail: Cost / Usage Spike Investigation

审计跟踪：成本/使用量激增调查

Identify what caused a Datadog usage spike by correlating billing data with configuration change history.

The causal chain is: someone changed something → that change increased data volume → usage spiked → cost went up. Usage Metering tells you when and what; Audit Trail tells you who made the change.

通过将账单数据与配置变更历史相关联，找出Datadog使用量激增的原因。

因果链为：有人进行了变更 → 该变更导致数据量增加 → 使用量激增 → 成本上升。使用计量可告知您何时、哪些指标出现峰值；审计跟踪可告知您谁进行了变更。

Prerequisites

前提条件

bash

pup auth login   # OAuth2 (recommended) — covers audit queries

bash

pup auth login   # OAuth2 (推荐) — 支持审计查询

Usage Metering queries also need DD_API_KEY + DD_APP_KEY

使用计量查询还需要DD_API_KEY + DD_APP_KEY

export DD_API_KEY=<your-api-key> export DD_APP_KEY=<your-app-key> export DD_SITE=datadoghq.com

undefined

export DD_API_KEY=<your-api-key> export DD_APP_KEY=<your-app-key> export DD_SITE=datadoghq.com

undefined

Scope Boundary

范围边界

This skill identifies configuration changes that may have caused a spike. It does not identify which specific user or process submitted the data (e.g., which service sent the LLM spans). For per-submission attribution, use LLM Observability traces or APM instrumentation.

本技能可识别可能导致激增的配置变更。它无法识别提交数据的具体用户或流程（例如，哪个服务发送了LLM追踪数据）。如需按提交来源归因，请使用LLM可观测性追踪或APM工具。

Investigation Workflow

调查流程

Step 1 — Identify the spike window and product family

步骤1 — 确定激增时间段和产品类别

bash

START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  --data-urlencode "filter[timestamp][start]=${START}" \
  --data-urlencode "filter[timestamp][end]=${END}" \
  --data-urlencode "filter[product_families]=all" \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      product: .attributes.product_family,
      measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
    }]'

Product families with LLM/AI coverage:

llm_observability

bits_ai

logs

apm

bash

START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  --data-urlencode "filter[timestamp][start]=${START}" \
  --data-urlencode "filter[timestamp][end]=${END}" \
  --data-urlencode "filter[product_families]=all" \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      product: .attributes.product_family,
      measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
    }]'

包含LLM/AI覆盖的产品类别：

llm_observability

bits_ai

logs

apm

Step 2 — Pinpoint the spike

步骤2 — 精确定位激增点

From Step 1, identify the hour/day where volume jumped. Note the timestamp as

SPIKE_TIME

从步骤1的结果中，找出数据量激增的小时/日期。将该时间戳记为

SPIKE_TIME

。

Step 3 — Search Audit Trail for config changes in the 24h preceding the spike

步骤3 — 在审计跟踪中搜索激增前24小时内的配置变更

bash

pup audit-logs search \
  --query "@action:(created OR modified OR deleted)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 200 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      actor_type: .attributes.attributes.evt.actor.type,
      action: .attributes.attributes.action,
      event_category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'

Note:
--from
and
--to
accept ISO timestamps (e.g.,
2026-05-01T14:00:00Z
) or relative values (
1h
,
24h
,
7d
).

bash

pup audit-logs search \
  --query "@action:(created OR modified OR deleted)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 200 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      actor_type: .attributes.attributes.evt.actor.type,
      action: .attributes.attributes.action,
      event_category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'

注意：
--from
和
--to
接受ISO时间戳（例如
2026-05-01T14:00:00Z
）或相对值（
1h
,
24h
,
7d
）。

Step 4 — Narrow to product-relevant config changes

步骤4 — 筛选与激增产品相关的配置变更

Filter to the audit categories most likely to affect the spiking product:

If this product spiked	Add to query
`llm_observability`	`@evt.name:(Integration OR APM OR "Log Management")`
`logs` / `indexed_logs`	`@evt.name:"Log Management" @asset.type:(pipeline OR index OR exclusion_filter)`
`apm` / `indexed_spans`	`@evt.name:APM @asset.type:(retention_filter OR sampling_rate)`
`rum`	`@evt.name:RUM`
`metrics`	`@evt.name:Metrics`

Example for LLM Observability spike:

bash

pup audit-logs search \
  --query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 100 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      action: .attributes.attributes.action,
      category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'

过滤出最可能影响激增产品的审计类别：

如果该产品出现激增	添加到查询条件
`llm_observability`	`@evt.name:(Integration OR APM OR "Log Management")`
`logs` / `indexed_logs`	`@evt.name:"Log Management" @asset.type:(pipeline OR index OR exclusion_filter)`
`apm` / `indexed_spans`	`@evt.name:APM @asset.type:(retention_filter OR sampling_rate)`
`rum`	`@evt.name:RUM`
`metrics`	`@evt.name:Metrics`

LLM可观测性激增示例：

bash

pup audit-logs search \
  --query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
  --from "SPIKE_TIME_MINUS_24H" \
  --to "SPIKE_TIME" \
  --limit 100 \
  -o json \
  | jq '[.data[] | {
      timestamp: .attributes.timestamp,
      user: .attributes.attributes.usr.email,
      action: .attributes.attributes.action,
      category: .attributes.attributes.evt.name,
      resource_type: .attributes.attributes.asset.type,
      resource_id: .attributes.attributes.asset.id
    }]'

Output Format

输出格式

Usage spike detected:
  Product: <product_family>
  Spike time: <SPIKE_TIME>
  Volume: <baseline> → <spike_value> (<magnitude>×)

Configuration changes in 24h preceding spike:
  <timestamp> | <user_email> | <action> <resource_type> <resource_id> | <category>

Likely causal change: <most-proximate change matching the product family>

Confidence: HIGH (single clear change) / MEDIUM (multiple candidates) / LOW (no matching changes)

Next steps:
  - Confirm with <user_email> whether the change was intentional
  - If unintentional: revert <resource_id> and monitor volume
  - If intentional: update cost forecasts and alert thresholds

检测到使用量激增：
  产品：<product_family>
  激增时间：<SPIKE_TIME>
  数据量：<基准值> → <峰值>（<倍数>倍）

激增前24小时内的配置变更：
  <时间戳> | <用户邮箱> | <操作> <资源类型> <资源ID> | <类别>

可能的因果变更：<与产品类别匹配的最接近的变更>

置信度：高（单一明确变更）/ 中（多个候选变更）/ 低（无匹配变更）

后续步骤：
  - 与<用户邮箱>确认该变更是否为有意操作
  - 若为无意操作：还原<资源ID>并监控数据量
  - 若为有意操作：更新成本预测和告警阈值

When No Causal Change Is Found

未找到因果变更时

The change may predate the 24h window — expand to 72h
The increase may be from application-side instrumentation changes — check deploys
The increase may be organic traffic growth — correlate with product launch or traffic event

变更可能早于24小时窗口——将时间范围扩大到72小时
数据量增加可能来自应用端工具变更——检查部署记录
数据量增加可能是自然流量增长——与产品发布或流量事件关联分析

dd-audit-cost-spike-investigation

Original

Translation

Audit Trail: Cost / Usage Spike Investigation

审计跟踪：成本/使用量激增调查

Prerequisites

前提条件

Usage Metering queries also need DD_API_KEY + DD_APP_KEY

使用计量查询还需要DD_API_KEY + DD_APP_KEY

Scope Boundary

范围边界

Investigation Workflow

调查流程

Step 1 — Identify the spike window and product family

步骤1 — 确定激增时间段和产品类别

Step 2 — Pinpoint the spike

步骤2 — 精确定位激增点

Step 3 — Search Audit Trail for config changes in the 24h preceding the spike

步骤3 — 在审计跟踪中搜索激增前24小时内的配置变更

Step 4 — Narrow to product-relevant config changes

步骤4 — 筛选与激增产品相关的配置变更

Output Format

输出格式

When No Causal Change Is Found

未找到因果变更时

References

参考资料