auditing-warehouse-data-health
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAuditing data warehouse health
审计数据仓库健康状况
This skill produces a project-wide audit of the data warehouse pipeline. Use it when the user wants a summary of
everything broken, not a deep-dive on one sync. The deep-dive on individual failures is
; this skill is the scan that tells them where to look first.
diagnosing-failed-warehouse-syncs此技能可对数据仓库管道进行全项目范围的审计。当用户需要所有损坏项的汇总,而非对单个同步进行深入排查时使用。针对单个故障的深入排查技能为;本技能是用于指导用户优先排查方向的扫描工具。
diagnosing-failed-warehouse-syncsWhen to use this skill
何时使用此技能
- "What's broken in my warehouse?" / "Give me a health check"
- "Audit my data pipeline"
- The user is new to a project and wants to know what they've inherited
- Weekly or monthly review of pipeline health
- Dashboards are stale and the user isn't sure which source is at fault
- “我的仓库里有什么损坏的?” / “帮我做健康检查”
- “审计我的数据管道”
- 用户刚接手项目,想要了解当前的遗留状况
- 每周或每月的管道健康检查
- 仪表板数据过时,用户不确定是哪个数据源出了问题
Available tools
可用工具
| Tool | Purpose |
|---|---|
| One-shot: all failed/degraded items across the whole pipeline |
| All sources with status and latest error |
| All schemas with status, last_synced_at, latest_error |
| All saved queries / materialized views with status and latest_error |
| Run history for a specific materialized view |
| Check per-source webhook state (not covered by data-health-issues) |
The endpoint already aggregates across materializations, sync schemas, sources, batch export
destinations, and transformations — it's the fastest path to a summary. Use the list endpoints when you need more
context than the summary provides (row counts, non-failing items, schema-level detail).
data-health-issues| Tool | 用途 |
|---|---|
| 一次性获取:整个管道中所有失败/降级的项 |
| 获取所有数据源及其状态和最新错误信息 |
| 获取所有模式及其状态、last_synced_at、最新错误信息 |
| 获取所有已保存查询/materialized views及其状态和最新错误信息 |
| 获取特定materialized view的运行历史 |
| 检查各数据源的webhook状态(data-health-issues未覆盖此内容) |
data-health-issuesWhat counts as an "issue"
什么属于“问题”
The data-health endpoint returns items from five categories:
| Trigger | Typical urgency |
|---|---|---|
| | High |
| schema in Failed or BillingLimitReached state (the data-health endpoint returns | Medium–High |
| | Medium |
| Batch export's latest run is FAILED / FAILED_RETRYABLE / TIMEDOUT / TERMINATED | Medium |
| HogFunction transformation in DISABLED / DEGRADED / FORCEFULLY_* state | Low–Medium |
Each entry includes , , , , , , , and (for syncs/sources)
.
idnametypestatuserrorfailed_aturlsource_typeNote the data-health endpoint only reports active failures. It doesn't flag:
- Schemas paused by the user ()
should_sync = false - Non-materialized views with errors (only materialized views are reported)
- Schemas that are slow or stale but technically
Completed - Webhook problems on schemas. The bulk-sync safety net can succeed while the webhook push channel is silently broken (deregistered, disabled on the remote side, failing signature verification). These don't surface in
sync_type: "webhook"— check per-source withdata-health-issues.webhook-info-retrieve
If the user asks about staleness or unused items, reach beyond this endpoint — see Step 4.
数据健康端点返回五类项的相关问题:
| 触发条件 | 典型紧急程度 |
|---|---|---|
| | 高 |
| schema处于Failed或BillingLimitReached状态(data-health端点分别返回 | 中–高 |
| | 中 |
| Batch export的最新运行状态为FAILED / FAILED_RETRYABLE / TIMEDOUT / TERMINATED | 中 |
| HogFunction transformation处于DISABLED / DEGRADED / FORCEFULLY__*状态 | 低–中 |
每条记录包含、、、、、、,以及(针对syncs/sources)。
idnametypestatuserrorfailed_aturlsource_type注意:data-health端点仅报告当前活跃故障。它不会标记以下情况:
- 用户暂停的schemas()
should_sync = false - 存在错误的Non-materialized views(仅报告materialized views)
- 运行缓慢或数据过时但状态显示为的schemas
Completed - 类型schemas的Webhook问题:批量同步的安全机制可能成功运行,但webhook推送通道已静默损坏(如注销、远程端禁用、签名验证失败)。这些问题不会在
sync_type: "webhook"中显示——需通过data-health-issues逐个检查数据源。webhook-info-retrieve
如果用户询问数据过时或未使用项的问题,需超出此端点的范围进行检查——详见步骤4。
Workflow
工作流程
Step 1 — One-shot pull
步骤1:一次性拉取数据
Call . This returns every actively failing item in one request.
data-warehouse-data-health-issues-retrieveIf the response is empty, tell the user their pipeline is healthy and stop. Don't invent problems.
调用。此请求将返回所有当前活跃的故障项。
data-warehouse-data-health-issues-retrieve如果响应为空,告知用户其管道运行正常并结束流程。不要虚构问题。
Step 2 — Group and prioritize
步骤2:分组并确定优先级
Group the issues by and sort within each group by severity:
type- Sources in Error first. A source failure cascades — every schema under it is effectively dead until the source reconnects. Fix these first.
- Sync schemas next, in this order:
- entries (billing issue, non-technical — flag and route to billing)
status: "billing_limit" - on heavily-used tables (user asks / check row counts via schemas-list if needed)
Failed - on less-used tables
Failed
- Materialized views. Usually independent of sources — a view failure is a HogQL or data issue in the view itself.
- Batch export destinations. Affect data going out of PostHog — important but generally not blocking reads.
- Transformations. Affect ingestion. Flag separately since these are HogFunction issues, not warehouse syncs.
按对问题进行分组,并在每组内按严重程度排序:
type- 优先处理处于Error状态的Sources:数据源故障会引发连锁反应——在数据源重新连接前,其下所有schema都会失效。需优先修复这些问题。
- 其次处理Sync schemas,顺序如下:
- 的记录(计费问题,非技术类——标记并转至计费部门处理)
status: "billing_limit" - 高频使用表的状态(用户询问时/必要时通过schemas-list检查行数)
Failed - 低频使用表的状态
Failed
- Materialized views:通常独立于数据源——视图故障是视图自身的HogQL或数据问题。
- Batch export destinations:影响从PostHog导出的数据——重要但通常不会阻塞读取操作。
- Transformations:影响数据摄入。需单独标记,因为这些是HogFunction问题,而非仓库同步问题。
Step 3 — Present the audit
步骤3:展示审计结果
Render a prioritized report. Don't dump the raw JSON — human-readable table per category:
text
undefined生成优先级报告。不要直接输出原始JSON——按类别生成易读的表格:
text
undefinedData warehouse health — 7 issues
Data warehouse health — 7 issues
🔴 Sources (1)
🔴 Sources (1)
- Stripe — authentication failed (failed 2h ago)
→ on this source
diagnosing-failed-warehouse-syncs
- Stripe — authentication failed (failed 2h ago)
→ on this source
diagnosing-failed-warehouse-syncs
🟠 Sync schemas (3)
🟠 Sync schemas (3)
- postgres_prod.orders (Failed 6h ago) — column "updated_at" does not exist
- postgres_prod.invoices (Failed 6h ago) — column "updated_at" does not exist
- hubspot.contacts (BillingLimitReached) — team quota exceeded
- postgres_prod.orders (Failed 6h ago) — column "updated_at" does not exist
- postgres_prod.invoices (Failed 6h ago) — column "updated_at" does not exist
- hubspot.contacts (BillingLimitReached) — team quota exceeded
🟠 Materialized views (2)
🟠 Materialized views (2)
- monthly_revenue — view failed (syntax error in HogQL)
- active_users_30d — view failed (missing table reference)
- monthly_revenue — view failed (syntax error in HogQL)
- active_users_30d — view failed (missing table reference)
🟡 Destinations (1)
🟡 Destinations (1)
- S3 export "daily-events" (FAILED_RETRYABLE 3 runs in a row)
Recommended order:
- Stripe auth (everything under it is dead)
- Schema-drift on postgres_prod.orders / invoices — looks like upstream renamed a column
- Billing limit on hubspot
- Materialized views (independent — can be tackled any time)
The exact format is less important than: prioritized, grouped, actionable, and hinting at the right next skill.- S3 export "daily-events" (FAILED_RETRYABLE 3 runs in a row)
Recommended order:
- Stripe auth (everything under it is dead)
- Schema-drift on postgres_prod.orders / invoices — looks like upstream renamed a column
- Billing limit on hubspot
- Materialized views (independent — can be tackled any time)
格式并非关键,重点在于:按优先级排序、分组展示、具备可操作性,并提示正确的后续技能。Step 4 — Go beyond active failures (when asked)
步骤4:超出活跃故障范围检查(用户要求时)
If the user wants more than just "what's on fire" — e.g. "what else should I look at?" — cross-check:
Stale but "Completed" schemas:
Call and look for schemas with old relative to their .
A schema on frequency that last synced 3 days ago is effectively broken even if status says .
external-data-schemas-listlast_synced_atsync_frequency1hourCompletedUnused materialized views:
Call . Materialized views cost storage and compute every run. If any are marked materialized but haven't
been queried lately, surface them — territory (not yet implemented, but the data
is available).
view-listcleaning-up-stale-warehouse-viewsSources with zero sync activity:
Sources where every schema has or . These were set up and then abandoned —
candidates for cleanup via .
should_sync: falsestatus = Pausedexternal-data-sources-destroyBroken webhooks on webhook-type schemas:
Iterate the sources that have any schema with (visible via ). For
each, call :
sync_type: "webhook"external-data-schemas-listexternal-data-sources-webhook-info-retrieve({source_id})- while a schema is
exists: false→ webhook was never registered, or was deleted. Push channel is dead; only the bulk fallback is ingesting.sync_type: "webhook" - present → remote service is reporting a problem (permission revoked, endpoint deleted on their dashboard).
external_status.error - not
external_status.status→ remote has disabled the endpoint (often after repeated delivery failures)."enabled"
Report these separately from the primary audit — they're a different shape of problem than the other categories,
and the fix is a different skill ( scenario I, or
step 5.5).
diagnosing-failed-warehouse-syncssetting-up-a-data-warehouse-sourceOnly run these extra checks if the user explicitly asks for a broader audit — they involve more tool calls and
heuristics.
如果用户需要的不仅仅是“当前紧急故障”——例如“我还应该检查什么?”——需进行交叉检查:
状态为"Completed"但数据过时的schemas:
调用,查找远早于其的schemas。例如,同步频率为但最后一次同步在3天前的schemas,即使状态显示为,实际上已失效。
external-data-schemas-listlast_synced_atsync_frequency1hourCompleted未使用的materialized views:
调用。Materialized views每次运行都会消耗存储和计算资源。如果有标记为materialized但近期未被查询的视图,需指出这些视图——属于技能的处理范围(尚未实现,但相关数据已可获取)。
view-listcleaning-up-stale-warehouse-views无同步活动的Sources:
所有schema均为或状态的数据源。这些数据源已被设置后弃用——可通过进行清理。
should_sync: falsestatus = Pausedexternal-data-sources-destroywebhook类型schemas的损坏Webhook:
遍历包含类型schema的数据源(可通过查看)。对每个此类数据源,调用:
sync_type: "webhook"external-data-schemas-listexternal-data-sources-webhook-info-retrieve({source_id})- 当schema为但
sync_type: "webhook"→ Webhook从未注册或已被删除。推送通道已失效;仅批量同步的 fallback 机制在摄入数据。exists: false - 存在→ 远程服务报告问题(如权限被撤销、端点在其控制台被删除)。
external_status.error - 不为
external_status.status→ 远程端已禁用该端点(通常在多次投递失败后)。"enabled"
将这些问题与主审计报告分开呈现——它们与其他类别的问题性质不同,修复时需使用不同的技能(场景I,或步骤5.5)。
diagnosing-failed-warehouse-syncssetting-up-a-data-warehouse-source仅当用户明确要求进行更全面的审计时,才执行这些额外检查——它们涉及更多工具调用和启发式判断。
Step 5 — Offer the next step
步骤5:提供后续步骤建议
End the audit with a clear hand-off:
- "Want me to dig into the Stripe failure?" → hands off to
diagnosing-failed-warehouse-syncs - "Want me to fix the schema drift on orders?" → hands off to
tuning-incremental-sync-config - "Want to disable the billing-capped schemas?" → one-click via
external-data-schemas-partial-update
Never start applying fixes autonomously from an audit — the audit's job is to report and recommend, not remediate.
Any fix should be confirmed explicitly before executing.
审计结束时,需明确提供后续操作选项:
- “需要我深入排查Stripe故障吗?” → 转至技能
diagnosing-failed-warehouse-syncs - “需要我修复orders的schema漂移问题吗?” → 转至技能
tuning-incremental-sync-config - “需要禁用受计费限制的schemas吗?” → 通过一键操作
external-data-schemas-partial-update
切勿在审计后自主开始修复操作——审计的职责是报告和建议,而非修复。任何修复操作都需在用户明确确认后执行。
Example interaction
示例交互
text
User: "Check the health of our data warehouse"
Agent:
- data-warehouse-data-health-issues-retrieve → 4 items
[{type: "source", name: "Stripe", error: "401 Unauthorized", ...},
{type: "external_data_sync", name: "postgres_prod.orders",
error: "column \"updated_at\" does not exist", ...},
{type: "materialized_view", name: "monthly_revenue",
error: "syntax error at or near \"FORM\"", ...},
{type: "destination", name: "S3 daily-events", error: "AccessDenied", ...}]
- Report:
"Found 4 issues in your data warehouse:
🔴 Sources (1)
- Stripe — auth failure 2h ago (401). All 8 tables under it are currently dead.
🟠 Sync schemas (1)
- postgres_prod.orders — column `updated_at` was dropped on the source side,
incremental sync can't find it.
🟠 Materialized views (1)
- monthly_revenue — HogQL syntax error ('FORM' instead of 'FROM').
🟡 Destinations (1)
- S3 export 'daily-events' — access denied on write.
Suggest fixing in this order:
1. Stripe credentials (unblocks everything under it)
2. orders incremental-field drift
3. monthly_revenue typo
4. S3 permissions
Want me to start on Stripe?"text
User: "Check the health of our data warehouse"
Agent:
- data-warehouse-data-health-issues-retrieve → 4 items
[{type: "source", name: "Stripe", error: "401 Unauthorized", ...},
{type: "external_data_sync", name: "postgres_prod.orders",
error: "column \"updated_at\" does not exist", ...},
{type: "materialized_view", name: "monthly_revenue",
error: "syntax error at or near \"FORM\"", ...},
{type: "destination", name: "S3 daily-events", error: "AccessDenied", ...}]
- Report:
"Found 4 issues in your data warehouse:
🔴 Sources (1)
- Stripe — auth failure 2h ago (401). All 8 tables under it are currently dead.
🟠 Sync schemas (1)
- postgres_prod.orders — column `updated_at` was dropped on the source side,
incremental sync can't find it.
🟠 Materialized views (1)
- monthly_revenue — HogQL syntax error ('FORM' instead of 'FROM').
🟡 Destinations (1)
- S3 export 'daily-events' — access denied on write.
Suggest fixing in this order:
1. Stripe credentials (unblocks everything under it)
2. orders incremental-field drift
3. monthly_revenue typo
4. S3 permissions
Want me to start on Stripe?"Important notes
重要注意事项
- The audit is read-only. Never call destructive tools from the audit flow. Hand off to the diagnosis/tuning skills — which in turn confirm before acting.
- Empty = healthy. Don't pad an empty audit with hypothetical issues. "No issues found" is a good answer.
- Source failures cascade. When reporting a source in Error, also mention which schemas under it are affected (or will be, once they try to sync again). The user needs to understand the blast radius.
- Billing limits aren't technical problems. Flag them but route to billing / quota discussion, not to a recovery action.
- Transformation issues are separate. HogFunctions aren't warehouse syncs — they show up in the audit because
they're part of the broader pipeline, but they live in the ingestion side. Route those to pipeline skills rather than trying to fix in-place here.
posthog - only surfaces active failures. For staleness, unused views, or abandoned sources, you need to cross-check the list endpoints. Only do this when the user explicitly asks for a deeper audit.
data-health-issues - Webhook health is separate from schema health. The data-health endpoint doesn't know about webhook state.
When a user's request mentions "real-time", "Stripe webhook", or "why is data hours behind on a webhook
source", go straight to rather than inferring from schema status.
webhook-info-retrieve
- 审计为只读操作:切勿在审计流程中调用破坏性工具。转至诊断/调优技能处理——这些技能会在执行操作前先确认用户意愿。
- 无结果即健康:不要在空审计结果中添加假设问题。“未发现问题”就是合适的答复。
- 数据源故障会连锁扩散:报告处于Error状态的数据源时,需同时提及受影响的schema(或再次尝试同步时会受影响的schema)。用户需要了解故障影响范围。
- 计费限制不属于技术问题:标记此类问题,但需转至计费/配额讨论,而非执行恢复操作。
- Transformation问题需单独处理:HogFunctions不属于仓库同步——它们出现在审计结果中是因为属于更广泛的管道的一部分,但实际位于数据摄入端。需将这些问题转至管道技能处理,而非在此处尝试修复。
posthog - 仅显示活跃故障:对于数据过时、未使用视图或已弃用数据源,需交叉检查列表类端点。仅当用户明确要求深入审计时才执行此操作。
data-health-issues - Webhook健康状况与schema健康状况分离:data-health端点无法获取webhook状态。当用户请求提及“实时”“Stripe webhook”或“为什么webhook数据源的数据滞后数小时”时,直接调用,而非通过schema状态推断。
webhook-info-retrieve