setting-up-a-data-warehouse-source
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSetting up a data warehouse source
数据仓库源设置
Use this skill when the user wants to connect an external data source to PostHog's data warehouse for the first time.
The setup has a specific three-step flow (wizard → db-schema → create) — skipping steps leads to failed sources and
confused users.
当用户首次想要将外部数据源连接到PostHog的数据仓库时,可使用本技能。
设置流程分为特定的三步(向导→db-schema→创建)——跳过步骤会导致源创建失败,让用户产生困惑。
When to use this skill
何时使用本技能
- The user wants to connect a new source: "connect Stripe", "import my Postgres orders table", "sync Hubspot contacts"
- The user isn't sure what source types PostHog supports
- The user has credentials but doesn't know how to structure the payload
schemas - The user wants guidance on which sync method to pick per table
- 用户想要连接新源:「连接Stripe」「导入我的Postgres订单表」「同步Hubspot联系人」
- 用户不确定PostHog支持哪些源类型
- 用户拥有凭证,但不知道如何构造负载
schemas - 用户需要针对每张表选择同步方式的指导
Available tools
可用工具
| Tool | Purpose |
|---|---|
| Discover which source types exist and what fields each needs |
| Validate credentials and list tables with available sync methods per table |
| Create the source — requires a |
| Postgres CDC pre-flight check (optional, only for Postgres CDC) |
| Check if a source supports webhooks and whether one has been registered |
| Register a webhook with the external service after source creation |
| Supply the signing secret manually when auto-registration failed |
| After creation, confirm the source is listed and see its initial status |
| See per-table sync status once the source is created |
| 工具名称 | 用途 |
|---|---|
| 发现存在哪些源类型,以及每种类型需要哪些字段 |
| 验证凭证,并列出每张表及其可用的同步方式 |
| 创建源——需要基于db-schema响应构建的 |
| Postgres CDC预检(可选,仅适用于Postgres CDC) |
| 检查源是否支持Webhook,以及是否已注册Webhook |
| 在源创建完成后,向外部服务注册Webhook |
| 当自动注册失败时,手动提供签名密钥 |
| 创建完成后,确认源已列出并查看其初始状态 |
| 源创建完成后,查看每张表的同步状态 |
The three-step flow
三步流程
Every source setup follows the same shape. Don't try to shortcut to — you need the
db-schema response to build a valid payload.
external-data-sources-createschemastext
┌────────────────────┐
│ 1. wizard │ What source types exist? What fields does each need?
└────────┬───────────┘
▼
┌────────────────────┐
│ 2. db-schema │ Validate creds. List tables + available sync methods per table.
└────────┬───────────┘
▼
┌────────────────────┐
│ 3. create │ Send source_type + credentials + schemas[] to actually create.
└────────────────────┘每个源的设置都遵循相同的流程。不要试图直接跳过步骤调用——你需要db-schema的响应来构建有效的负载。
external-data-sources-createschemastext
┌────────────────────┐
│ 1. wizard │ 有哪些源类型?每种类型需要哪些字段?
└────────┬───────────┘
▼
┌────────────────────┐
│ 2. db-schema │ 验证凭证。列出表格及每张表可用的同步方式。
└────────┬───────────┘
▼
┌────────────────────┐
│ 3. create │ 发送source_type + 凭证 + schemas[]来实际创建源。
└────────────────────┘Workflow
工作流程
Step 1 — Discover the source type
步骤1 — 发现源类型
Call (no params). The response is a dict keyed by source type. Each entry describes:
external-data-sources-wizard- — the canonical source_type string you'll pass to later calls (e.g.
name,"Postgres","Stripe")."Hubspot" - /
label— human-readable.caption - — the config fields needed (host, port, database, api_key, client_id/secret, ...). Each has
fields,name(input, password, switch, select, file-upload), andtype.required - ,
featured— use to gauge readiness. Skip sources markedunreleasedSourceunless the user explicitly asked for a preview.unreleasedSource: true
Match the user's request to a source. If they said "Postgres", look up . If they said something ambiguous
like "database", present the top relevant matches (Postgres, MySQL, MongoDB, BigQuery, Snowflake, Redshift) and let
them pick.
PostgresFor OAuth-based sources (Hubspot, Salesforce, Google Ads), the wizard entry hints at an OAuth flow. These typically
need the user to authorize in the PostHog UI rather than pasting credentials — explain this and direct them to the
source setup page rather than trying to collect tokens in chat. OAuth is about authentication, not about how data
flows; OAuth sources still use polling bulk sync, not webhooks.
Gather the required credentials from the user. Never ask for more fields than the wizard entry says are required —
asking for an unnecessary when the source doesn't need one confuses users.
port调用(无参数)。响应是一个以源类型为键的字典。每个条目包含:
external-data-sources-wizard- — 后续调用中需传入的标准source_type字符串(例如
name、"Postgres"、"Stripe")。"Hubspot" - /
label— 人类可读的名称。caption - — 所需的配置字段(主机、端口、数据库、api_key、client_id/密钥等)。每个字段包含
fields、name(输入框、密码框、开关、选择器、文件上传)和type。required - 、
featured— 用于判断源的就绪状态。除非用户明确要求预览,否则跳过标记为unreleasedSource的源。unreleasedSource: true
将用户的请求与源类型匹配。如果用户说“Postgres”,就查找;如果用户的请求模糊,比如“数据库”,则展示最相关的匹配项(Postgres、MySQL、MongoDB、BigQuery、Snowflake、Redshift),让用户选择。
Postgres对于基于OAuth的源(Hubspot、Salesforce、Google Ads),向导条目会提示OAuth流程。这类源通常需要用户在PostHog UI中授权,而非粘贴凭证——请向用户说明这一点,并引导他们前往源设置页面,不要尝试在聊天中收集令牌。OAuth仅用于身份验证,与数据流无关;OAuth源仍使用轮询批量同步,而非Webhook。
向用户收集所需的凭证。切勿询问向导条目中未标记为必填的字段——例如,当源不需要时询问该字段会让用户困惑。
portStep 2 — Validate credentials and discover tables
步骤2 — 验证凭证并发现表
Call with plus all credential fields. This does two things at once:
external-data-sources-db-schemasource_type- Validates the credentials against the live source. Returns 400 with a if anything is wrong (bad host, wrong password, permission denied). Show the error verbatim — it's often actionable ("password authentication failed for user 'x'").
message - If valid, returns an array of table entries. Each entry:
text
{
"table": "orders",
"should_sync": false,
"rows": 1_250_000,
"incremental_available": true, # can do sync_type=incremental
"append_available": true, # can do sync_type=append
"cdc_available": true, # can do sync_type=cdc (null = not enabled for team)
"supports_webhooks": false, # can do sync_type=webhook for real-time push
"incremental_fields": [ # candidates: usually updated_at, created_at, id
{"field": "updated_at", "type": "datetime", "label": "updated_at", ...},
{"field": "created_at", "type": "datetime", ...},
{"field": "id", "type": "integer", ...}
],
"detected_primary_keys": ["id"],
"available_columns": [{"field": "id", "type": "integer", "nullable": false}, ...],
"description": "..."
}Present this to the user. Don't dump the raw JSON — summarize: which tables were found, row counts, and the default
sync method recommendation per table (see sync-type decision guide).
调用,传入及所有凭证字段。该操作同时完成两件事:
external-data-sources-db-schemasource_type- 针对实际源验证凭证。如果有错误(主机错误、密码错误、权限不足),会返回400状态码和。直接向用户展示错误信息——这些信息通常可用于排查问题(例如“用户'x'的密码验证失败”)。
message - 如果验证通过,会返回一个表条目数组。每个条目如下:
text
{
"table": "orders",
"should_sync": false,
"rows": 1_250_000,
"incremental_available": true, # 可使用sync_type=incremental
"append_available": true, # 可使用sync_type=append
"cdc_available": true, # 可使用sync_type=cdc (null = 团队未启用)
"supports_webhooks": false, # 可使用sync_type=webhook进行实时推送
"incremental_fields": [ # 候选字段:通常为updated_at、created_at、id
{"field": "updated_at", "type": "datetime", "label": "updated_at", ...},
{"field": "created_at", "type": "datetime", ...},
{"field": "id", "type": "integer", ...}
],
"detected_primary_keys": ["id"],
"available_columns": [{"field": "id", "type": "integer", "nullable": false}, ...],
"description": "..."
}将这些信息呈现给用户。不要直接输出原始JSON——进行总结:已发现哪些表、行数,以及每张表的默认同步方式建议(参见同步类型决策指南)。
Step 3 — Confirm per-table sync configuration
步骤3 — 确认逐表同步配置
For each table the user wants to sync, pick a sync_type. See the
sync-type decision guide for detailed rules, but the short version is:
- Small / dimension tables (<50k rows, no natural ordering column): — simple and always correct.
full_refresh - Large tables with an /
updated_at:modified_at— much cheaper per sync.incremental - Append-only immutable tables (logs, events): if available — preserves history.
append - Postgres with CDC enabled and you need near-real-time: — requires primary keys and Postgres prerequisites.
cdc - Sources that support webhooks (currently Stripe): for near-real-time ingestion set on the tables where
sync_type: "webhook", then register the webhook as a post-create step (see step 6 below). Tables that don't support webhooks on the same source still need a bulk sync_type.supports_webhooks: true
For each schema that will use //, you also need:
incrementalappendcdc- — which column to track for high-water-mark ordering. Pick from the
incremental_fieldlist returned by db-schema. Preferincremental_fieldsoverupdated_at(updated_at catches late-arriving updates; created_at misses them). For integer-only tables, use the monotonically increasing primary key.created_at - — must match the chosen field's type (
incremental_field_type,datetime,timestamp,date,integer,numeric).objectid - — required for CDC. Use
primary_key_columnsfrom db-schema.detected_primary_keys
对于用户想要同步的每张表,选择sync_type。详细规则请参见同步类型决策指南,以下是简化版:
- 小型/维度表(<5万行,无自然排序列): ——简单且始终准确。
full_refresh - 带有/
updated_at的大型表:modified_at——每次同步成本低得多。incremental - 仅追加的不可变表(日志、事件): 如果支持则使用——保留历史记录。
append - 启用CDC且需要近实时同步的Postgres: ——需要主键和Postgres前置条件。
cdc - 支持Webhook的源(目前仅Stripe): 对于需要近实时 ingestion 的表,在的表上设置
supports_webhooks: true,然后在创建完成后执行注册Webhook的步骤(参见下文步骤6)。同一源中不支持Webhook的表仍需使用批量同步类型。sync_type: "webhook"
对于每个使用//的schema,还需要:
incrementalappendcdc- ——用于跟踪高水位标记排序的列。从db-schema返回的
incremental_field列表中选择。优先选择incremental_fields而非updated_at(updated_at能捕获延迟到达的更新;created_at会遗漏这些更新)。对于仅含整数的表,使用单调递增的主键。created_at - ——必须与所选字段的类型匹配(
incremental_field_type、datetime、timestamp、date、integer、numeric)。objectid - ——CDC必填。使用db-schema返回的
primary_key_columns。detected_primary_keys
Step 4 — Pick a good prefix
步骤4 — 选择合适的前缀
The source's is prepended to table names in HogQL. Tables end up as .
prefix{prefix}_{table_name}- Default to the source type lowercased if there's only one source of that type: ,
stripe.postgres - If the user already has a Postgres source, pick something distinguishing: ,
postgres_prod.postgres_analytics - Use lowercase, underscore-separated. The prefix becomes part of every HogQL query the user writes.
Confirm the prefix with the user before creating — changing it later is possible but renames every table.
源的会添加到HogQL中的表名前,最终表名格式为。
prefix{prefix}_{table_name}- 如果该类型的源只有一个,默认使用小写的源类型:、
stripe。postgres - 如果用户已有一个Postgres源,选择具有区分度的名称:、
postgres_prod。postgres_analytics - 使用小写、下划线分隔的格式。前缀会成为用户编写的每个HogQL查询的一部分。
在创建前与用户确认前缀——后续虽可修改,但会重命名所有表。
Step 5 — Create the source
步骤5 — 创建源
Call with:
external-data-sources-createjson
{
"source_type": "Postgres",
"prefix": "postgres_prod",
"payload": {
"host": "...",
"port": "5432",
"dbname": "...",
"user": "...",
"password": "...",
"schema": "public",
"schemas": [
{
"name": "orders",
"should_sync": true,
"sync_type": "incremental",
"incremental_field": "updated_at",
"incremental_field_type": "datetime",
"primary_key_columns": ["id"]
},
{
"name": "users",
"should_sync": true,
"sync_type": "full_refresh"
},
{
"name": "audit_log",
"should_sync": false
}
]
}
}Rules for the array:
schemas- Every table returned by db-schema should be included, even ones the user doesn't want (set ). Tables the user didn't mention default to
should_sync: false.should_sync: false - is required only when
sync_type.should_sync: true - /
incremental_fieldmust be present whenincremental_field_typeissync_typeorincremental.append - must be present when
primary_key_columnsissync_type.cdc
On success you'll get back a source with a new . The first sync is triggered automatically.
id调用,传入以下参数:
external-data-sources-createjson
{
"source_type": "Postgres",
"prefix": "postgres_prod",
"payload": {
"host": "...",
"port": "5432",
"dbname": "...",
"user": "...",
"password": "...",
"schema": "public",
"schemas": [
{
"name": "orders",
"should_sync": true,
"sync_type": "incremental",
"incremental_field": "updated_at",
"incremental_field_type": "datetime",
"primary_key_columns": ["id"]
},
{
"name": "users",
"should_sync": true,
"sync_type": "full_refresh"
},
{
"name": "audit_log",
"should_sync": false
}
]
}
}schemas- db-schema返回的每张表都应包含在内,即使用户不想同步(设置)。用户未提及的表默认设置
should_sync: false。should_sync: false - 仅当时,才需要
should_sync: true。sync_type - 当为
sync_type或incremental时,必须提供append/incremental_field。incremental_field_type - 当为
sync_type时,必须提供cdc。primary_key_columns
成功后会返回一个带有新的源。首次同步会自动触发。
idStep 6 — Register a webhook (only when any schema is sync_type: "webhook"
)
sync_type: "webhook"步骤6 — 注册Webhook(仅当有schema设置为sync_type: "webhook"
时)
sync_type: "webhook"Webhook-type schemas don't start receiving data just by existing — the external service needs to know where to POST
events, and PostHog needs to know how to verify them. This is a second call after source creation, not part of the
payload. Do this before telling the user the setup is complete, otherwise they
hear "syncs are running" while the push channel is still unregistered.
external-data-sources-createOnly needed when at least one schema on the source has and . Currently only
Stripe implements this flow; for everything else skip this step.
sync_type: "webhook"should_sync: trueBefore calling create-webhook, check . If it already returns
, do NOT call create-webhook again — each successful call registers a new external endpoint and would
result in duplicate deliveries.
external-data-sources-webhook-info-retrieve({id})exists: true-
Call. PostHog:
external-data-sources-create-webhook-create({id})- creates the HogFunction that will receive webhook POSTs,
- builds a schema_mapping from external event types to PostHog schema ids,
- calls the source's API (e.g. Stripe) to register the webhook URL and subscribe to the relevant events,
- on Stripe, auto-captures the and stores it securely.
signing_secret
Returns. On success report the{success, webhook_url, error}to the user for their records — but they don't need to paste it anywhere; registration is already done.webhook_url -
Ifwith a permissions error like "API key doesn't have permission to create webhooks":
success: false- The HogFunction is still created, just disabled.
- Ask the user to create the webhook manually in the source's dashboard using the returned .
webhook_url - Have them copy the signing secret from the source's webhook settings.
- Call to store it. The HogFunction picks it up and verifies incoming payloads.
external-data-sources-update-webhook-inputs-create({id}, {inputs: {signing_secret: "whsec_..."}})
-
Verify with. A healthy webhook has
external-data-sources-webhook-info-retrieve({id}),exists: true, and noexternal_status.status: "enabled".error
Webhooks are supplementary to bulk sync. The first load of a webhook-enabled schema is still done via polling
( flips to true when done); after that, the webhook becomes the primary ingestion path. A
webhook schema will still have a that schedules a periodic bulk refresh as a safety net. This is
expected — not something to "fix".
initial_sync_completesync_frequency设置为Webhook类型的schema不会自动开始接收数据——外部服务需要知道向哪个URL发送POST请求,PostHog也需要知道如何验证这些请求。这是源创建完成后的第二步操作,不属于负载的一部分。在告知用户设置完成前务必完成此步骤,否则用户会听到“同步正在运行”,但推送通道尚未注册。
external-data-sources-create仅当源上至少有一个schema设置为且时才需要执行此步骤。目前仅Stripe实现了此流程;其他源请跳过此步骤。
sync_type: "webhook"should_sync: true调用create-webhook前,先调用。如果返回,请勿再次调用create-webhook——每次成功调用都会注册一个新的外部端点,导致重复投递。
external-data-sources-webhook-info-retrieve({id})exists: true-
调用。PostHog会:
external-data-sources-create-webhook-create({id})- 创建用于接收Webhook POST请求的HogFunction,
- 构建从外部事件类型到PostHog schema ID的schema_mapping,
- 调用源的API(例如Stripe)注册Webhook URL并订阅相关事件,
- 在Stripe上自动捕获并安全存储。
signing_secret
返回。成功后将{success, webhook_url, error}告知用户留存——但用户无需粘贴该URL,注册已完成。webhook_url -
如果且返回权限错误,例如“API密钥没有创建Webhook的权限”:
success: false- HogFunction仍会创建,但处于禁用状态。
- 请用户使用返回的在源的仪表盘中手动创建Webhook。
webhook_url - 让用户从源的Webhook设置中复制签名密钥。
- 调用存储密钥。HogFunction会获取该密钥并验证传入的负载。
external-data-sources-update-webhook-inputs-create({id}, {inputs: {signing_secret: "whsec_..."}})
-
调用验证。健康的Webhook应满足
external-data-sources-webhook-info-retrieve({id})、exists: true且无external_status.status: "enabled"。error
Webhook是批量同步的补充。支持Webhook的schema的首次加载仍通过轮询完成(完成后会变为true);之后,Webhook会成为主要的 ingestion 路径。Webhook schema仍会有,用于定期调度批量刷新作为安全保障。这是预期行为——无需“修复”。
initial_sync_completesync_frequencyStep 7 — Confirm and explain what happens next
步骤7 — 确认并说明后续操作
After creation (and, for webhook schemas, after Step 6):
- Call to show the user the initial state.
external-data-schemas-list - Explain: every enabled schema enters , then moves to
Runningwhen the first sync finishes. First syncs can take anywhere from seconds to hours depending on row count — a multi-million-row table is fine, just slow.Completed - Tell them how to query: in HogQL.
SELECT * FROM {prefix}_{table_name} LIMIT 10 - Offer to check back in a few minutes to confirm the initial syncs succeeded.
创建完成后(对于Webhook schema,需完成步骤6后):
- 调用向用户展示初始状态。
external-data-schemas-list - 说明:每个启用的schema会进入状态,首次同步完成后变为
Running。首次同步的耗时从几秒到几小时不等,取决于行数——数百万行的表是正常的,只是速度较慢。Completed - 告知用户查询方式:在HogQL中使用。
SELECT * FROM {prefix}_{table_name} LIMIT 10 - 可主动提出几分钟后再次检查,确认首次同步是否成功。
CDC setup for Postgres (optional, when requested)
Postgres的CDC设置(可选,仅当用户请求时)
If the user wants near-real-time replication from Postgres:
- Before calling db-schema, run with their Postgres creds. It returns
external-data-sources-check-cdc-prerequisites-createlisting anything missing (wal_level, replication slot, publication, permissions).{valid, errors[]} - If , present the errors and ask the user to fix on the Postgres side. Don't try to create a CDC source that will immediately fail.
valid: false - Once prerequisites pass, proceed to db-schema and create. Set on the tables that need it, and include
sync_type: "cdc"for each (CDC requires them).primary_key_columns
如果用户想要从Postgres进行近实时复制:
- 在调用db-schema前,使用用户的Postgres凭证调用。返回
external-data-sources-check-cdc-prerequisites-create,列出所有缺失的条件(wal_level、复制槽、发布、权限)。{valid, errors[]} - 如果,向用户展示错误信息,并请用户在Postgres端修复。不要尝试创建会立即失败的CDC源。
valid: false - 前置条件满足后,继续执行db-schema和创建步骤。在需要的表上设置,并为每个表添加
sync_type: "cdc"(CDC需要主键)。primary_key_columns
Important notes
重要注意事项
- Always validate creds with db-schema before create. The create endpoint will accept invalid creds and then fail
asynchronously — the source appears in the list with status and no tables. Skipping the validation step just pushes the failure into the background.
Error - Present the table list before creating. Large databases may have hundreds of tables. Don't auto-select them all — row counts and relevance matter for billing. Let the user opt in explicitly.
- Don't invent schemas. Every entry in the array must correspond to a real table from the db-schema response. You can't "also add an orders table" unless db-schema found one.
schemas - Prefix is load-bearing. It's part of every HogQL query the user will ever write against these tables. Pick something short, descriptive, and not already taken.
- OAuth sources are different. Hubspot, Salesforce, Google Ads etc. need the user to authorize via the PostHog UI. Direct them there — don't try to collect OAuth tokens in chat.
- Webhooks are a separate step after create. Setting on a schema doesn't register the webhook — the
sync_type: "webhook"call does. Always follow create → create-webhook → webhook-info for webhook-type schemas, and never leave a webhook schema dangling without registration (it just won't receive events).create-webhook - Webhook support is source-specific and sparse. Currently only Stripe implements . Don't promise webhooks for Hubspot, Salesforce, or Postgres — they'll use polling sync.
WebhookSource - Row counts drive billing. Warehouse syncing is metered by rows synced. A chatty 500M-row events table synced
hourly is very different from a 10k-row dimension table synced daily. Flag large tables and offer longer sync
frequencies () as the default.
sync_frequency: "24hour"
- 创建前务必使用db-schema验证凭证。 创建端点会接受无效凭证,然后异步失败——源会出现在列表中,状态为且无表。跳过验证步骤只会将失败隐藏到后台。
Error - 创建前先展示表列表。 大型数据库可能有数百张表。不要自动全选——行数和相关性会影响计费。让用户明确选择需要同步的表。
- 不要自行构造schemas。 数组中的每个条目必须对应db-schema响应中的真实表。除非db-schema发现了
schemas表,否则不能“添加orders表”。orders - 前缀至关重要。 它是用户针对这些表编写的每个HogQL查询的一部分。选择简短、描述性强且未被使用的前缀。
- OAuth源有所不同。 Hubspot、Salesforce、Google Ads等源需要用户通过PostHog UI授权。引导用户前往该页面——不要尝试在聊天中收集OAuth令牌。
- Webhook是创建后的独立步骤。 在schema上设置不会自动注册Webhook——需要调用
sync_type: "webhook"。对于Webhook类型的schema,务必遵循创建→创建Webhook→检查Webhook信息的流程,不要让Webhook schema处于未注册状态(否则无法接收事件)。create-webhook - Webhook支持是源特定的,且目前较少。 目前仅Stripe实现了。不要向Hubspot、Salesforce或Postgres用户承诺Webhook支持——这些源使用轮询同步。
WebhookSource - 行数影响计费。 仓库同步按同步的行数计费。一个有5亿行的事件表每小时同步,与一个有1万行的维度表每天同步,成本差异极大。标记大型表,并建议将较长的同步频率()设为默认值。
sync_frequency: "24hour"