auditing-warehouse-data-health

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Auditing data warehouse health

审计数据仓库健康状况

This skill produces a project-wide audit of the data warehouse pipeline. Use it when the user wants a summary of everything broken, not a deep-dive on one sync. The deep-dive on individual failures is

diagnosing-failed-warehouse-syncs

; this skill is the scan that tells them where to look first.

此技能可对数据仓库管道进行全项目范围的审计。当用户需要所有损坏项的汇总，而非对单个同步进行深入排查时使用。针对单个故障的深入排查技能为

diagnosing-failed-warehouse-syncs

；本技能是用于指导用户优先排查方向的扫描工具。

When to use this skill

何时使用此技能

"What's broken in my warehouse?" / "Give me a health check"
"Audit my data pipeline"
The user is new to a project and wants to know what they've inherited
Weekly or monthly review of pipeline health
Dashboards are stale and the user isn't sure which source is at fault

“我的仓库里有什么损坏的？” / “帮我做健康检查”
“审计我的数据管道”
用户刚接手项目，想要了解当前的遗留状况
每周或每月的管道健康检查
仪表板数据过时，用户不确定是哪个数据源出了问题

Available tools

可用工具

Tool	Purpose
`data-warehouse-data-health-issues-retrieve`	One-shot: all failed/degraded items across the whole pipeline
`external-data-sources-list`	All sources with status and latest error
`external-data-schemas-list`	All schemas with status, last_synced_at, latest_error
`view-list`	All saved queries / materialized views with status and latest_error
`view-run-history`	Run history for a specific materialized view
`external-data-sources-webhook-info-retrieve`	Check per-source webhook state (not covered by data-health-issues)

The

data-health-issues

endpoint already aggregates across materializations, sync schemas, sources, batch export destinations, and transformations — it's the fastest path to a summary. Use the list endpoints when you need more context than the summary provides (row counts, non-failing items, schema-level detail).

Tool	用途
`data-warehouse-data-health-issues-retrieve`	一次性获取：整个管道中所有失败/降级的项
`external-data-sources-list`	获取所有数据源及其状态和最新错误信息
`external-data-schemas-list`	获取所有模式及其状态、last_synced_at、最新错误信息
`view-list`	获取所有已保存查询/materialized views及其状态和最新错误信息
`view-run-history`	获取特定materialized view的运行历史
`external-data-sources-webhook-info-retrieve`	检查各数据源的webhook状态（data-health-issues未覆盖此内容）

data-health-issues

端点已汇总了materializations、sync schemas、sources、batch export destinations和transformations的相关信息——这是获取汇总结果的最快途径。当你需要比汇总结果更详细的上下文信息（如行数、未失败项、模式级细节）时，可使用列表类端点。

What counts as an "issue"

什么属于“问题”

The data-health endpoint returns items from five categories:

`type`	Trigger	Typical urgency
`source`	`ExternalDataSource.status = Error` — whole source connection broken	High
`external_data_sync`	schema in Failed or BillingLimitReached state (the data-health endpoint returns `status: "failed"` or `status: "billing_limit"` respectively)	Medium–High
`materialized_view`	`DataWarehouseSavedQuery.is_materialized=true, status=Failed`	Medium
`destination`	Batch export's latest run is FAILED / FAILED_RETRYABLE / TIMEDOUT / TERMINATED	Medium
`transformation`	HogFunction transformation in DISABLED / DEGRADED / FORCEFULLY_* state	Low–Medium

Each entry includes

id

name

type

status

error

failed_at

url

, and (for syncs/sources)

source_type

Note the data-health endpoint only reports active failures. It doesn't flag:

Schemas paused by the user (
```
should_sync = false
```
)
Non-materialized views with errors (only materialized views are reported)
Schemas that are slow or stale but technically
```
Completed
```
Webhook problems on
sync_type: "webhook"
schemas. The bulk-sync safety net can succeed while the webhook push channel is silently broken (deregistered, disabled on the remote side, failing signature verification). These don't surface in
```
data-health-issues
```
— check per-source with
```
webhook-info-retrieve
```
.

If the user asks about staleness or unused items, reach beyond this endpoint — see Step 4.

数据健康端点返回五类项的相关问题：

`type`	触发条件	典型紧急程度
`source`	`ExternalDataSource.status = Error` — 整个数据源连接损坏	高
`external_data_sync`	schema处于Failed或BillingLimitReached状态（data-health端点分别返回 `status: "failed"` 或 `status: "billing_limit"` ）	中–高
`materialized_view`	`DataWarehouseSavedQuery.is_materialized=true, status=Failed`	中
`destination`	Batch export的最新运行状态为FAILED / FAILED_RETRYABLE / TIMEDOUT / TERMINATED	中
`transformation`	HogFunction transformation处于DISABLED / DEGRADED / FORCEFULLY__*状态	低–中

每条记录包含

id

、

name

、

type

、

status

、

error

、

failed_at

、

url

，以及（针对syncs/sources）

source_type

。

注意：data-health端点仅报告当前活跃故障。它不会标记以下情况：

用户暂停的schemas（
```
should_sync = false
```
）
存在错误的Non-materialized views（仅报告materialized views）
运行缓慢或数据过时但状态显示为
```
Completed
```
的schemas
sync_type: "webhook"
类型schemas的Webhook问题：批量同步的安全机制可能成功运行，但webhook推送通道已静默损坏（如注销、远程端禁用、签名验证失败）。这些问题不会在
```
data-health-issues
```
中显示——需通过
```
webhook-info-retrieve
```
逐个检查数据源。

如果用户询问数据过时或未使用项的问题，需超出此端点的范围进行检查——详见步骤4。

Workflow

工作流程

Step 1 — One-shot pull

步骤1：一次性拉取数据

Call

data-warehouse-data-health-issues-retrieve

. This returns every actively failing item in one request.

If the response is empty, tell the user their pipeline is healthy and stop. Don't invent problems.

调用

data-warehouse-data-health-issues-retrieve

。此请求将返回所有当前活跃的故障项。

如果响应为空，告知用户其管道运行正常并结束流程。不要虚构问题。

Step 2 — Group and prioritize

步骤2：分组并确定优先级

Group the issues by

type

and sort within each group by severity:

Sources in Error first. A source failure cascades — every schema under it is effectively dead until the source reconnects. Fix these first.
Sync schemas next, in this order:
- ```
status: "billing_limit"
```
  entries (billing issue, non-technical — flag and route to billing)
- ```
Failed
```
  on heavily-used tables (user asks / check row counts via schemas-list if needed)
- ```
Failed
```
  on less-used tables
Materialized views. Usually independent of sources — a view failure is a HogQL or data issue in the view itself.
Batch export destinations. Affect data going out of PostHog — important but generally not blocking reads.
Transformations. Affect ingestion. Flag separately since these are HogFunction issues, not warehouse syncs.

按

type

对问题进行分组，并在每组内按严重程度排序：

优先处理处于Error状态的Sources：数据源故障会引发连锁反应——在数据源重新连接前，其下所有schema都会失效。需优先修复这些问题。
其次处理Sync schemas，顺序如下：
- ```
status: "billing_limit"
```
  的记录（计费问题，非技术类——标记并转至计费部门处理）
- 高频使用表的
```
Failed
```
  状态（用户询问时/必要时通过schemas-list检查行数）
- 低频使用表的
```
Failed
```
  状态
Materialized views：通常独立于数据源——视图故障是视图自身的HogQL或数据问题。
Batch export destinations：影响从PostHog导出的数据——重要但通常不会阻塞读取操作。
Transformations：影响数据摄入。需单独标记，因为这些是HogFunction问题，而非仓库同步问题。

Step 3 — Present the audit

步骤3：展示审计结果

Render a prioritized report. Don't dump the raw JSON — human-readable table per category:

text

undefined

生成优先级报告。不要直接输出原始JSON——按类别生成易读的表格：

text

undefined

Data warehouse health — 7 issues

🔴 Sources (1)

Stripe — authentication failed (failed 2h ago) →
```
diagnosing-failed-warehouse-syncs
```
on this source

Stripe — authentication failed (failed 2h ago) →
```
diagnosing-failed-warehouse-syncs
```
on this source

🟠 Sync schemas (3)

postgres_prod.orders (Failed 6h ago) — column "updated_at" does not exist
postgres_prod.invoices (Failed 6h ago) — column "updated_at" does not exist
hubspot.contacts (BillingLimitReached) — team quota exceeded

postgres_prod.orders (Failed 6h ago) — column "updated_at" does not exist
postgres_prod.invoices (Failed 6h ago) — column "updated_at" does not exist
hubspot.contacts (BillingLimitReached) — team quota exceeded

🟠 Materialized views (2)

monthly_revenue — view failed (syntax error in HogQL)
active_users_30d — view failed (missing table reference)

monthly_revenue — view failed (syntax error in HogQL)
active_users_30d — view failed (missing table reference)

🟡 Destinations (1)

S3 export "daily-events" (FAILED_RETRYABLE 3 runs in a row)

Recommended order:

Stripe auth (everything under it is dead)
Schema-drift on postgres_prod.orders / invoices — looks like upstream renamed a column
Billing limit on hubspot
Materialized views (independent — can be tackled any time)


The exact format is less important than: prioritized, grouped, actionable, and hinting at the right next skill.

S3 export "daily-events" (FAILED_RETRYABLE 3 runs in a row)

Recommended order:

Stripe auth (everything under it is dead)
Schema-drift on postgres_prod.orders / invoices — looks like upstream renamed a column
Billing limit on hubspot
Materialized views (independent — can be tackled any time)


格式并非关键，重点在于：按优先级排序、分组展示、具备可操作性，并提示正确的后续技能。

Step 4 — Go beyond active failures (when asked)

步骤4：超出活跃故障范围检查（用户要求时）

If the user wants more than just "what's on fire" — e.g. "what else should I look at?" — cross-check:

Stale but "Completed" schemas: Call

external-data-schemas-list

and look for schemas with old

last_synced_at

relative to their

sync_frequency

. A schema on

1hour

frequency that last synced 3 days ago is effectively broken even if status says

Completed

Unused materialized views: Call

view-list

. Materialized views cost storage and compute every run. If any are marked materialized but haven't been queried lately, surface them —

cleaning-up-stale-warehouse-views

territory (not yet implemented, but the data is available).

Sources with zero sync activity: Sources where every schema has

should_sync: false

status = Paused

. These were set up and then abandoned — candidates for cleanup via

external-data-sources-destroy

Broken webhooks on webhook-type schemas: Iterate the sources that have any schema with

sync_type: "webhook"

(visible via

external-data-schemas-list

). For each, call

external-data-sources-webhook-info-retrieve({source_id})

```
exists: false
```
while a schema is
```
sync_type: "webhook"
```
→ webhook was never registered, or was deleted. Push channel is dead; only the bulk fallback is ingesting.
```
external_status.error
```
present → remote service is reporting a problem (permission revoked, endpoint deleted on their dashboard).
```
external_status.status
```
not
```
"enabled"
```
→ remote has disabled the endpoint (often after repeated delivery failures).

Report these separately from the primary audit — they're a different shape of problem than the other categories, and the fix is a different skill (

diagnosing-failed-warehouse-syncs

scenario I, or

setting-up-a-data-warehouse-source

step 5.5).

Only run these extra checks if the user explicitly asks for a broader audit — they involve more tool calls and heuristics.

如果用户需要的不仅仅是“当前紧急故障”——例如“我还应该检查什么？”——需进行交叉检查：

状态为"Completed"但数据过时的schemas：调用

external-data-schemas-list

，查找

last_synced_at

远早于其

sync_frequency

的schemas。例如，同步频率为

1hour

但最后一次同步在3天前的schemas，即使状态显示为

Completed

，实际上已失效。

未使用的materialized views：调用

view-list

。Materialized views每次运行都会消耗存储和计算资源。如果有标记为materialized但近期未被查询的视图，需指出这些视图——属于

cleaning-up-stale-warehouse-views

技能的处理范围（尚未实现，但相关数据已可获取）。

无同步活动的Sources：所有schema均为

should_sync: false

或

status = Paused

状态的数据源。这些数据源已被设置后弃用——可通过

external-data-sources-destroy

进行清理。

webhook类型schemas的损坏Webhook：遍历包含

sync_type: "webhook"

类型schema的数据源（可通过

external-data-schemas-list

查看）。对每个此类数据源，调用

external-data-sources-webhook-info-retrieve({source_id})

：

当schema为
```
sync_type: "webhook"
```
但
```
exists: false
```
→ Webhook从未注册或已被删除。推送通道已失效；仅批量同步的 fallback 机制在摄入数据。
存在
```
external_status.error
```
→ 远程服务报告问题（如权限被撤销、端点在其控制台被删除）。
```
external_status.status
```
不为
```
"enabled"
```
→ 远程端已禁用该端点（通常在多次投递失败后）。

将这些问题与主审计报告分开呈现——它们与其他类别的问题性质不同，修复时需使用不同的技能（

diagnosing-failed-warehouse-syncs

场景I，或

setting-up-a-data-warehouse-source

步骤5.5）。

仅当用户明确要求进行更全面的审计时，才执行这些额外检查——它们涉及更多工具调用和启发式判断。

Step 5 — Offer the next step

步骤5：提供后续步骤建议

End the audit with a clear hand-off:

"Want me to dig into the Stripe failure?" → hands off to
```
diagnosing-failed-warehouse-syncs
```
"Want me to fix the schema drift on orders?" → hands off to
```
tuning-incremental-sync-config
```
"Want to disable the billing-capped schemas?" → one-click via
```
external-data-schemas-partial-update
```

Never start applying fixes autonomously from an audit — the audit's job is to report and recommend, not remediate. Any fix should be confirmed explicitly before executing.

审计结束时，需明确提供后续操作选项：

“需要我深入排查Stripe故障吗？” → 转至
```
diagnosing-failed-warehouse-syncs
```
技能
“需要我修复orders的schema漂移问题吗？” → 转至
```
tuning-incremental-sync-config
```
技能
“需要禁用受计费限制的schemas吗？” → 通过
```
external-data-schemas-partial-update
```
一键操作

切勿在审计后自主开始修复操作——审计的职责是报告和建议，而非修复。任何修复操作都需在用户明确确认后执行。

Example interaction

示例交互

text

User: "Check the health of our data warehouse"

Agent:
- data-warehouse-data-health-issues-retrieve → 4 items
   [{type: "source", name: "Stripe", error: "401 Unauthorized", ...},
    {type: "external_data_sync", name: "postgres_prod.orders",
     error: "column \"updated_at\" does not exist", ...},
    {type: "materialized_view", name: "monthly_revenue",
     error: "syntax error at or near \"FORM\"", ...},
    {type: "destination", name: "S3 daily-events", error: "AccessDenied", ...}]

- Report:
  "Found 4 issues in your data warehouse:

   🔴 Sources (1)
   - Stripe — auth failure 2h ago (401). All 8 tables under it are currently dead.

   🟠 Sync schemas (1)
   - postgres_prod.orders — column `updated_at` was dropped on the source side,
     incremental sync can't find it.

   🟠 Materialized views (1)
   - monthly_revenue — HogQL syntax error ('FORM' instead of 'FROM').

   🟡 Destinations (1)
   - S3 export 'daily-events' — access denied on write.

   Suggest fixing in this order:
   1. Stripe credentials (unblocks everything under it)
   2. orders incremental-field drift
   3. monthly_revenue typo
   4. S3 permissions

   Want me to start on Stripe?"

text

User: "Check the health of our data warehouse"

Agent:
- data-warehouse-data-health-issues-retrieve → 4 items
   [{type: "source", name: "Stripe", error: "401 Unauthorized", ...},
    {type: "external_data_sync", name: "postgres_prod.orders",
     error: "column \"updated_at\" does not exist", ...},
    {type: "materialized_view", name: "monthly_revenue",
     error: "syntax error at or near \"FORM\"", ...},
    {type: "destination", name: "S3 daily-events", error: "AccessDenied", ...}]

- Report:
  "Found 4 issues in your data warehouse:

   🔴 Sources (1)
   - Stripe — auth failure 2h ago (401). All 8 tables under it are currently dead.

   🟠 Sync schemas (1)
   - postgres_prod.orders — column `updated_at` was dropped on the source side,
     incremental sync can't find it.

   🟠 Materialized views (1)
   - monthly_revenue — HogQL syntax error ('FORM' instead of 'FROM').

   🟡 Destinations (1)
   - S3 export 'daily-events' — access denied on write.

   Suggest fixing in this order:
   1. Stripe credentials (unblocks everything under it)
   2. orders incremental-field drift
   3. monthly_revenue typo
   4. S3 permissions

   Want me to start on Stripe?"

Important notes

重要注意事项

The audit is read-only. Never call destructive tools from the audit flow. Hand off to the diagnosis/tuning skills — which in turn confirm before acting.
Empty = healthy. Don't pad an empty audit with hypothetical issues. "No issues found" is a good answer.
Source failures cascade. When reporting a source in Error, also mention which schemas under it are affected (or will be, once they try to sync again). The user needs to understand the blast radius.
Billing limits aren't technical problems. Flag them but route to billing / quota discussion, not to a recovery action.
Transformation issues are separate. HogFunctions aren't warehouse syncs — they show up in the audit because they're part of the broader pipeline, but they live in the
```
posthog
```
ingestion side. Route those to pipeline skills rather than trying to fix in-place here.
data-health-issues
only surfaces active failures. For staleness, unused views, or abandoned sources, you need to cross-check the list endpoints. Only do this when the user explicitly asks for a deeper audit.
Webhook health is separate from schema health. The data-health endpoint doesn't know about webhook state. When a user's request mentions "real-time", "Stripe webhook", or "why is data hours behind on a webhook source", go straight to
```
webhook-info-retrieve
```
rather than inferring from schema status.

审计为只读操作：切勿在审计流程中调用破坏性工具。转至诊断/调优技能处理——这些技能会在执行操作前先确认用户意愿。
无结果即健康：不要在空审计结果中添加假设问题。“未发现问题”就是合适的答复。
数据源故障会连锁扩散：报告处于Error状态的数据源时，需同时提及受影响的schema（或再次尝试同步时会受影响的schema）。用户需要了解故障影响范围。
计费限制不属于技术问题：标记此类问题，但需转至计费/配额讨论，而非执行恢复操作。
Transformation问题需单独处理：HogFunctions不属于仓库同步——它们出现在审计结果中是因为属于更广泛的管道的一部分，但实际位于
```
posthog
```
数据摄入端。需将这些问题转至管道技能处理，而非在此处尝试修复。
data-health-issues
仅显示活跃故障：对于数据过时、未使用视图或已弃用数据源，需交叉检查列表类端点。仅当用户明确要求深入审计时才执行此操作。
Webhook健康状况与schema健康状况分离：data-health端点无法获取webhook状态。当用户请求提及“实时”“Stripe webhook”或“为什么webhook数据源的数据滞后数小时”时，直接调用
```
webhook-info-retrieve
```
，而非通过schema状态推断。