dd-audit-cost-spike-investigation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAudit Trail: Cost / Usage Spike Investigation
审计跟踪:成本/使用量激增调查
Identify what caused a Datadog usage spike by correlating billing data with configuration change history.
The causal chain is: someone changed something → that change increased data volume → usage spiked → cost went up. Usage Metering tells you when and what; Audit Trail tells you who made the change.
通过将账单数据与配置变更历史相关联,找出Datadog使用量激增的原因。
因果链为:有人进行了变更 → 该变更导致数据量增加 → 使用量激增 → 成本上升。使用计量可告知您何时、哪些指标出现峰值;审计跟踪可告知您谁进行了变更。
Prerequisites
前提条件
bash
pup auth login # OAuth2 (recommended) — covers audit queriesbash
pup auth login # OAuth2 (推荐) — 支持审计查询Usage Metering queries also need DD_API_KEY + DD_APP_KEY
使用计量查询还需要DD_API_KEY + DD_APP_KEY
export DD_API_KEY=<your-api-key>
export DD_APP_KEY=<your-app-key>
export DD_SITE=datadoghq.com
undefinedexport DD_API_KEY=<your-api-key>
export DD_APP_KEY=<your-app-key>
export DD_SITE=datadoghq.com
undefinedScope Boundary
范围边界
This skill identifies configuration changes that may have caused a spike. It does not identify which specific user or process submitted the data (e.g., which service sent the LLM spans). For per-submission attribution, use LLM Observability traces or APM instrumentation.
本技能可识别可能导致激增的配置变更。它无法识别提交数据的具体用户或流程(例如,哪个服务发送了LLM追踪数据)。如需按提交来源归因,请使用LLM可观测性追踪或APM工具。
Investigation Workflow
调查流程
Step 1 — Identify the spike window and product family
步骤1 — 确定激增时间段和产品类别
bash
START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
--data-urlencode "filter[timestamp][start]=${START}" \
--data-urlencode "filter[timestamp][end]=${END}" \
--data-urlencode "filter[product_families]=all" \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
product: .attributes.product_family,
measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
}]'Product families with LLM/AI coverage: , , ,
llm_observabilitybits_ailogsapmbash
START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -d "7 days ago" +"%Y-%m-%dT%H:%M:%SZ")
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
curl -s -G "https://api.${DD_SITE}/api/v2/usage/hourly_usage" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
--data-urlencode "filter[timestamp][start]=${START}" \
--data-urlencode "filter[timestamp][end]=${END}" \
--data-urlencode "filter[product_families]=all" \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
product: .attributes.product_family,
measurements: [.attributes.measurements[] | {type: .usage_type, value: .value}]
}]'包含LLM/AI覆盖的产品类别: , , ,
llm_observabilitybits_ailogsapmStep 2 — Pinpoint the spike
步骤2 — 精确定位激增点
From Step 1, identify the hour/day where volume jumped. Note the timestamp as .
SPIKE_TIME从步骤1的结果中,找出数据量激增的小时/日期。将该时间戳记为。
SPIKE_TIMEStep 3 — Search Audit Trail for config changes in the 24h preceding the spike
步骤3 — 在审计跟踪中搜索激增前24小时内的配置变更
bash
pup audit-logs search \
--query "@action:(created OR modified OR deleted)" \
--from "SPIKE_TIME_MINUS_24H" \
--to "SPIKE_TIME" \
--limit 200 \
-o json \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
user: .attributes.attributes.usr.email,
actor_type: .attributes.attributes.evt.actor.type,
action: .attributes.attributes.action,
event_category: .attributes.attributes.evt.name,
resource_type: .attributes.attributes.asset.type,
resource_id: .attributes.attributes.asset.id
}]'Note:and--fromaccept ISO timestamps (e.g.,--to) or relative values (2026-05-01T14:00:00Z,1h,24h).7d
bash
pup audit-logs search \
--query "@action:(created OR modified OR deleted)" \
--from "SPIKE_TIME_MINUS_24H" \
--to "SPIKE_TIME" \
--limit 200 \
-o json \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
user: .attributes.attributes.usr.email,
actor_type: .attributes.attributes.evt.actor.type,
action: .attributes.attributes.action,
event_category: .attributes.attributes.evt.name,
resource_type: .attributes.attributes.asset.type,
resource_id: .attributes.attributes.asset.id
}]'注意:和--from接受ISO时间戳(例如--to)或相对值(2026-05-01T14:00:00Z,1h,24h)。7d
Step 4 — Narrow to product-relevant config changes
步骤4 — 筛选与激增产品相关的配置变更
Filter to the audit categories most likely to affect the spiking product:
| If this product spiked | Add to query |
|---|---|
| |
| |
| |
| |
| |
Example for LLM Observability spike:
bash
pup audit-logs search \
--query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
--from "SPIKE_TIME_MINUS_24H" \
--to "SPIKE_TIME" \
--limit 100 \
-o json \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
user: .attributes.attributes.usr.email,
action: .attributes.attributes.action,
category: .attributes.attributes.evt.name,
resource_type: .attributes.attributes.asset.type,
resource_id: .attributes.attributes.asset.id
}]'过滤出最可能影响激增产品的审计类别:
| 如果该产品出现激增 | 添加到查询条件 |
|---|---|
| |
| |
| |
| |
| |
LLM可观测性激增示例:
bash
pup audit-logs search \
--query "@evt.name:(Integration OR APM OR \"Log Management\") @action:(created OR modified)" \
--from "SPIKE_TIME_MINUS_24H" \
--to "SPIKE_TIME" \
--limit 100 \
-o json \
| jq '[.data[] | {
timestamp: .attributes.timestamp,
user: .attributes.attributes.usr.email,
action: .attributes.attributes.action,
category: .attributes.attributes.evt.name,
resource_type: .attributes.attributes.asset.type,
resource_id: .attributes.attributes.asset.id
}]'Output Format
输出格式
Usage spike detected:
Product: <product_family>
Spike time: <SPIKE_TIME>
Volume: <baseline> → <spike_value> (<magnitude>×)
Configuration changes in 24h preceding spike:
<timestamp> | <user_email> | <action> <resource_type> <resource_id> | <category>
Likely causal change: <most-proximate change matching the product family>
Confidence: HIGH (single clear change) / MEDIUM (multiple candidates) / LOW (no matching changes)
Next steps:
- Confirm with <user_email> whether the change was intentional
- If unintentional: revert <resource_id> and monitor volume
- If intentional: update cost forecasts and alert thresholds检测到使用量激增:
产品:<product_family>
激增时间:<SPIKE_TIME>
数据量:<基准值> → <峰值>(<倍数>倍)
激增前24小时内的配置变更:
<时间戳> | <用户邮箱> | <操作> <资源类型> <资源ID> | <类别>
可能的因果变更:<与产品类别匹配的最接近的变更>
置信度:高(单一明确变更)/ 中(多个候选变更)/ 低(无匹配变更)
后续步骤:
- 与<用户邮箱>确认该变更是否为有意操作
- 若为无意操作:还原<资源ID>并监控数据量
- 若为有意操作:更新成本预测和告警阈值When No Causal Change Is Found
未找到因果变更时
- The change may predate the 24h window — expand to 72h
- The increase may be from application-side instrumentation changes — check deploys
- The increase may be organic traffic growth — correlate with product launch or traffic event
- 变更可能早于24小时窗口——将时间范围扩大到72小时
- 数据量增加可能来自应用端工具变更——检查部署记录
- 数据量增加可能是自然流量增长——与产品发布或流量事件关联分析