setting-up-a-data-warehouse-source

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Setting up a data warehouse source

数据仓库源设置

Use this skill when the user wants to connect an external data source to PostHog's data warehouse for the first time. The setup has a specific three-step flow (wizard → db-schema → create) — skipping steps leads to failed sources and confused users.
当用户首次想要将外部数据源连接到PostHog的数据仓库时,可使用本技能。 设置流程分为特定的三步(向导→db-schema→创建)——跳过步骤会导致源创建失败,让用户产生困惑。

When to use this skill

何时使用本技能

  • The user wants to connect a new source: "connect Stripe", "import my Postgres orders table", "sync Hubspot contacts"
  • The user isn't sure what source types PostHog supports
  • The user has credentials but doesn't know how to structure the
    schemas
    payload
  • The user wants guidance on which sync method to pick per table
  • 用户想要连接新源:「连接Stripe」「导入我的Postgres订单表」「同步Hubspot联系人」
  • 用户不确定PostHog支持哪些源类型
  • 用户拥有凭证,但不知道如何构造
    schemas
    负载
  • 用户需要针对每张表选择同步方式的指导

Available tools

可用工具

ToolPurpose
external-data-sources-wizard
Discover which source types exist and what fields each needs
external-data-sources-db-schema
Validate credentials and list tables with available sync methods per table
external-data-sources-create
Create the source — requires a
schemas
array built from the db-schema response
external-data-sources-check-cdc-prerequisites-create
Postgres CDC pre-flight check (optional, only for Postgres CDC)
external-data-sources-webhook-info-retrieve
Check if a source supports webhooks and whether one has been registered
external-data-sources-create-webhook-create
Register a webhook with the external service after source creation
external-data-sources-update-webhook-inputs-create
Supply the signing secret manually when auto-registration failed
external-data-sources-list
After creation, confirm the source is listed and see its initial status
external-data-schemas-list
See per-table sync status once the source is created
工具名称用途
external-data-sources-wizard
发现存在哪些源类型,以及每种类型需要哪些字段
external-data-sources-db-schema
验证凭证,并列出每张表及其可用的同步方式
external-data-sources-create
创建源——需要基于db-schema响应构建的
schemas
数组
external-data-sources-check-cdc-prerequisites-create
Postgres CDC预检(可选,仅适用于Postgres CDC)
external-data-sources-webhook-info-retrieve
检查源是否支持Webhook,以及是否已注册Webhook
external-data-sources-create-webhook-create
在源创建完成后,向外部服务注册Webhook
external-data-sources-update-webhook-inputs-create
当自动注册失败时,手动提供签名密钥
external-data-sources-list
创建完成后,确认源已列出并查看其初始状态
external-data-schemas-list
源创建完成后,查看每张表的同步状态

The three-step flow

三步流程

Every source setup follows the same shape. Don't try to shortcut to
external-data-sources-create
— you need the db-schema response to build a valid
schemas
payload.
text
         ┌────────────────────┐
         │ 1. wizard          │  What source types exist? What fields does each need?
         └────────┬───────────┘
         ┌────────────────────┐
         │ 2. db-schema       │  Validate creds. List tables + available sync methods per table.
         └────────┬───────────┘
         ┌────────────────────┐
         │ 3. create          │  Send source_type + credentials + schemas[] to actually create.
         └────────────────────┘
每个源的设置都遵循相同的流程。不要试图直接跳过步骤调用
external-data-sources-create
——你需要db-schema的响应来构建有效的
schemas
负载。
text
         ┌────────────────────┐
         │ 1. wizard          │  有哪些源类型?每种类型需要哪些字段?
         └────────┬───────────┘
         ┌────────────────────┐
         │ 2. db-schema       │  验证凭证。列出表格及每张表可用的同步方式。
         └────────┬───────────┘
         ┌────────────────────┐
         │ 3. create          │  发送source_type + 凭证 + schemas[]来实际创建源。
         └────────────────────┘

Workflow

工作流程

Step 1 — Discover the source type

步骤1 — 发现源类型

Call
external-data-sources-wizard
(no params). The response is a dict keyed by source type. Each entry describes:
  • name
    — the canonical source_type string you'll pass to later calls (e.g.
    "Postgres"
    ,
    "Stripe"
    ,
    "Hubspot"
    ).
  • label
    /
    caption
    — human-readable.
  • fields
    — the config fields needed (host, port, database, api_key, client_id/secret, ...). Each has
    name
    ,
    type
    (input, password, switch, select, file-upload), and
    required
    .
  • featured
    ,
    unreleasedSource
    — use to gauge readiness. Skip sources marked
    unreleasedSource: true
    unless the user explicitly asked for a preview.
Match the user's request to a source. If they said "Postgres", look up
Postgres
. If they said something ambiguous like "database", present the top relevant matches (Postgres, MySQL, MongoDB, BigQuery, Snowflake, Redshift) and let them pick.
For OAuth-based sources (Hubspot, Salesforce, Google Ads), the wizard entry hints at an OAuth flow. These typically need the user to authorize in the PostHog UI rather than pasting credentials — explain this and direct them to the source setup page rather than trying to collect tokens in chat. OAuth is about authentication, not about how data flows; OAuth sources still use polling bulk sync, not webhooks.
Gather the required credentials from the user. Never ask for more fields than the wizard entry says are required — asking for an unnecessary
port
when the source doesn't need one confuses users.
调用
external-data-sources-wizard
(无参数)。响应是一个以源类型为键的字典。每个条目包含:
  • name
    — 后续调用中需传入的标准source_type字符串(例如
    "Postgres"
    "Stripe"
    "Hubspot"
    )。
  • label
    /
    caption
    — 人类可读的名称。
  • fields
    — 所需的配置字段(主机、端口、数据库、api_key、client_id/密钥等)。每个字段包含
    name
    type
    (输入框、密码框、开关、选择器、文件上传)和
    required
  • featured
    unreleasedSource
    — 用于判断源的就绪状态。除非用户明确要求预览,否则跳过标记为
    unreleasedSource: true
    的源。
将用户的请求与源类型匹配。如果用户说“Postgres”,就查找
Postgres
;如果用户的请求模糊,比如“数据库”,则展示最相关的匹配项(Postgres、MySQL、MongoDB、BigQuery、Snowflake、Redshift),让用户选择。
对于基于OAuth的源(Hubspot、Salesforce、Google Ads),向导条目会提示OAuth流程。这类源通常需要用户在PostHog UI中授权,而非粘贴凭证——请向用户说明这一点,并引导他们前往源设置页面,不要尝试在聊天中收集令牌。OAuth仅用于身份验证,与数据流无关;OAuth源仍使用轮询批量同步,而非Webhook。
向用户收集所需的凭证。切勿询问向导条目中未标记为必填的字段——例如,当源不需要
port
时询问该字段会让用户困惑。

Step 2 — Validate credentials and discover tables

步骤2 — 验证凭证并发现表

Call
external-data-sources-db-schema
with
source_type
plus all credential fields. This does two things at once:
  1. Validates the credentials against the live source. Returns 400 with a
    message
    if anything is wrong (bad host, wrong password, permission denied). Show the error verbatim — it's often actionable ("password authentication failed for user 'x'").
  2. If valid, returns an array of table entries. Each entry:
text
{
  "table": "orders",
  "should_sync": false,
  "rows": 1_250_000,
  "incremental_available": true,   # can do sync_type=incremental
  "append_available": true,        # can do sync_type=append
  "cdc_available": true,           # can do sync_type=cdc  (null = not enabled for team)
  "supports_webhooks": false,      # can do sync_type=webhook for real-time push
  "incremental_fields": [          # candidates: usually updated_at, created_at, id
    {"field": "updated_at", "type": "datetime", "label": "updated_at", ...},
    {"field": "created_at", "type": "datetime", ...},
    {"field": "id", "type": "integer", ...}
  ],
  "detected_primary_keys": ["id"],
  "available_columns": [{"field": "id", "type": "integer", "nullable": false}, ...],
  "description": "..."
}
Present this to the user. Don't dump the raw JSON — summarize: which tables were found, row counts, and the default sync method recommendation per table (see sync-type decision guide).
调用
external-data-sources-db-schema
,传入
source_type
及所有凭证字段。该操作同时完成两件事:
  1. 针对实际源验证凭证。如果有错误(主机错误、密码错误、权限不足),会返回400状态码和
    message
    。直接向用户展示错误信息——这些信息通常可用于排查问题(例如“用户'x'的密码验证失败”)。
  2. 如果验证通过,会返回一个表条目数组。每个条目如下:
text
{
  "table": "orders",
  "should_sync": false,
  "rows": 1_250_000,
  "incremental_available": true,   # 可使用sync_type=incremental
  "append_available": true,        # 可使用sync_type=append
  "cdc_available": true,           # 可使用sync_type=cdc (null = 团队未启用)
  "supports_webhooks": false,      # 可使用sync_type=webhook进行实时推送
  "incremental_fields": [          # 候选字段:通常为updated_at、created_at、id
    {"field": "updated_at", "type": "datetime", "label": "updated_at", ...},
    {"field": "created_at", "type": "datetime", ...},
    {"field": "id", "type": "integer", ...}
  ],
  "detected_primary_keys": ["id"],
  "available_columns": [{"field": "id", "type": "integer", "nullable": false}, ...],
  "description": "..."
}
将这些信息呈现给用户。不要直接输出原始JSON——进行总结:已发现哪些表、行数,以及每张表的默认同步方式建议(参见同步类型决策指南)。

Step 3 — Confirm per-table sync configuration

步骤3 — 确认逐表同步配置

For each table the user wants to sync, pick a sync_type. See the sync-type decision guide for detailed rules, but the short version is:
  • Small / dimension tables (<50k rows, no natural ordering column):
    full_refresh
    — simple and always correct.
  • Large tables with an
    updated_at
    /
    modified_at
    :
    incremental
    — much cheaper per sync.
  • Append-only immutable tables (logs, events):
    append
    if available — preserves history.
  • Postgres with CDC enabled and you need near-real-time:
    cdc
    — requires primary keys and Postgres prerequisites.
  • Sources that support webhooks (currently Stripe): for near-real-time ingestion set
    sync_type: "webhook"
    on the tables where
    supports_webhooks: true
    , then register the webhook as a post-create step (see step 6 below). Tables that don't support webhooks on the same source still need a bulk sync_type.
For each schema that will use
incremental
/
append
/
cdc
, you also need:
  • incremental_field
    — which column to track for high-water-mark ordering. Pick from the
    incremental_fields
    list returned by db-schema. Prefer
    updated_at
    over
    created_at
    (updated_at catches late-arriving updates; created_at misses them). For integer-only tables, use the monotonically increasing primary key.
  • incremental_field_type
    — must match the chosen field's type (
    datetime
    ,
    timestamp
    ,
    date
    ,
    integer
    ,
    numeric
    ,
    objectid
    ).
  • primary_key_columns
    — required for CDC. Use
    detected_primary_keys
    from db-schema.
对于用户想要同步的每张表,选择sync_type。详细规则请参见同步类型决策指南,以下是简化版:
  • 小型/维度表(<5万行,无自然排序列):
    full_refresh
    ——简单且始终准确。
  • 带有
    updated_at
    /
    modified_at
    的大型表:
    incremental
    ——每次同步成本低得多。
  • 仅追加的不可变表(日志、事件): 如果支持则使用
    append
    ——保留历史记录。
  • 启用CDC且需要近实时同步的Postgres:
    cdc
    ——需要主键和Postgres前置条件。
  • 支持Webhook的源(目前仅Stripe): 对于需要近实时 ingestion 的表,在
    supports_webhooks: true
    的表上设置
    sync_type: "webhook"
    ,然后在创建完成后执行注册Webhook的步骤(参见下文步骤6)。同一源中不支持Webhook的表仍需使用批量同步类型。
对于每个使用
incremental
/
append
/
cdc
的schema,还需要:
  • incremental_field
    ——用于跟踪高水位标记排序的列。从db-schema返回的
    incremental_fields
    列表中选择。优先选择
    updated_at
    而非
    created_at
    (updated_at能捕获延迟到达的更新;created_at会遗漏这些更新)。对于仅含整数的表,使用单调递增的主键。
  • incremental_field_type
    ——必须与所选字段的类型匹配(
    datetime
    timestamp
    date
    integer
    numeric
    objectid
    )。
  • primary_key_columns
    ——CDC必填。使用db-schema返回的
    detected_primary_keys

Step 4 — Pick a good prefix

步骤4 — 选择合适的前缀

The source's
prefix
is prepended to table names in HogQL. Tables end up as
{prefix}_{table_name}
.
  • Default to the source type lowercased if there's only one source of that type:
    stripe
    ,
    postgres
    .
  • If the user already has a Postgres source, pick something distinguishing:
    postgres_prod
    ,
    postgres_analytics
    .
  • Use lowercase, underscore-separated. The prefix becomes part of every HogQL query the user writes.
Confirm the prefix with the user before creating — changing it later is possible but renames every table.
源的
prefix
会添加到HogQL中的表名前,最终表名格式为
{prefix}_{table_name}
  • 如果该类型的源只有一个,默认使用小写的源类型:
    stripe
    postgres
  • 如果用户已有一个Postgres源,选择具有区分度的名称:
    postgres_prod
    postgres_analytics
  • 使用小写、下划线分隔的格式。前缀会成为用户编写的每个HogQL查询的一部分。
在创建前与用户确认前缀——后续虽可修改,但会重命名所有表。

Step 5 — Create the source

步骤5 — 创建源

Call
external-data-sources-create
with:
json
{
  "source_type": "Postgres",
  "prefix": "postgres_prod",
  "payload": {
    "host": "...",
    "port": "5432",
    "dbname": "...",
    "user": "...",
    "password": "...",
    "schema": "public",
    "schemas": [
      {
        "name": "orders",
        "should_sync": true,
        "sync_type": "incremental",
        "incremental_field": "updated_at",
        "incremental_field_type": "datetime",
        "primary_key_columns": ["id"]
      },
      {
        "name": "users",
        "should_sync": true,
        "sync_type": "full_refresh"
      },
      {
        "name": "audit_log",
        "should_sync": false
      }
    ]
  }
}
Rules for the
schemas
array:
  • Every table returned by db-schema should be included, even ones the user doesn't want (set
    should_sync: false
    ). Tables the user didn't mention default to
    should_sync: false
    .
  • sync_type
    is required only when
    should_sync: true
    .
  • incremental_field
    /
    incremental_field_type
    must be present when
    sync_type
    is
    incremental
    or
    append
    .
  • primary_key_columns
    must be present when
    sync_type
    is
    cdc
    .
On success you'll get back a source with a new
id
. The first sync is triggered automatically.
调用
external-data-sources-create
,传入以下参数:
json
{
  "source_type": "Postgres",
  "prefix": "postgres_prod",
  "payload": {
    "host": "...",
    "port": "5432",
    "dbname": "...",
    "user": "...",
    "password": "...",
    "schema": "public",
    "schemas": [
      {
        "name": "orders",
        "should_sync": true,
        "sync_type": "incremental",
        "incremental_field": "updated_at",
        "incremental_field_type": "datetime",
        "primary_key_columns": ["id"]
      },
      {
        "name": "users",
        "should_sync": true,
        "sync_type": "full_refresh"
      },
      {
        "name": "audit_log",
        "should_sync": false
      }
    ]
  }
}
schemas
数组的规则:
  • db-schema返回的每张表都应包含在内,即使用户不想同步(设置
    should_sync: false
    )。用户未提及的表默认设置
    should_sync: false
  • 仅当
    should_sync: true
    时,才需要
    sync_type
  • sync_type
    incremental
    append
    时,必须提供
    incremental_field
    /
    incremental_field_type
  • sync_type
    cdc
    时,必须提供
    primary_key_columns
成功后会返回一个带有新
id
的源。首次同步会自动触发。

Step 6 — Register a webhook (only when any schema is
sync_type: "webhook"
)

步骤6 — 注册Webhook(仅当有schema设置为
sync_type: "webhook"
时)

Webhook-type schemas don't start receiving data just by existing — the external service needs to know where to POST events, and PostHog needs to know how to verify them. This is a second call after source creation, not part of the
external-data-sources-create
payload. Do this before telling the user the setup is complete, otherwise they hear "syncs are running" while the push channel is still unregistered.
Only needed when at least one schema on the source has
sync_type: "webhook"
and
should_sync: true
. Currently only Stripe implements this flow; for everything else skip this step.
Before calling create-webhook, check
external-data-sources-webhook-info-retrieve({id})
. If it already returns
exists: true
, do NOT call create-webhook again — each successful call registers a new external endpoint and would result in duplicate deliveries.
  1. Call
    external-data-sources-create-webhook-create({id})
    . PostHog:
    • creates the HogFunction that will receive webhook POSTs,
    • builds a schema_mapping from external event types to PostHog schema ids,
    • calls the source's API (e.g. Stripe) to register the webhook URL and subscribe to the relevant events,
    • on Stripe, auto-captures the
      signing_secret
      and stores it securely.
    Returns
    {success, webhook_url, error}
    . On success report the
    webhook_url
    to the user for their records — but they don't need to paste it anywhere; registration is already done.
  2. If
    success: false
    with a permissions error like "API key doesn't have permission to create webhooks":
    • The HogFunction is still created, just disabled.
    • Ask the user to create the webhook manually in the source's dashboard using the returned
      webhook_url
      .
    • Have them copy the signing secret from the source's webhook settings.
    • Call
      external-data-sources-update-webhook-inputs-create({id}, {inputs: {signing_secret: "whsec_..."}})
      to store it. The HogFunction picks it up and verifies incoming payloads.
  3. Verify with
    external-data-sources-webhook-info-retrieve({id})
    . A healthy webhook has
    exists: true
    ,
    external_status.status: "enabled"
    , and no
    error
    .
Webhooks are supplementary to bulk sync. The first load of a webhook-enabled schema is still done via polling (
initial_sync_complete
flips to true when done); after that, the webhook becomes the primary ingestion path. A webhook schema will still have a
sync_frequency
that schedules a periodic bulk refresh as a safety net. This is expected — not something to "fix".
设置为Webhook类型的schema不会自动开始接收数据——外部服务需要知道向哪个URL发送POST请求,PostHog也需要知道如何验证这些请求。这是源创建完成后的第二步操作,不属于
external-data-sources-create
负载的一部分。在告知用户设置完成前务必完成此步骤,否则用户会听到“同步正在运行”,但推送通道尚未注册。
仅当源上至少有一个schema设置为
sync_type: "webhook"
should_sync: true
时才需要执行此步骤。目前仅Stripe实现了此流程;其他源请跳过此步骤。
调用create-webhook前,先调用
external-data-sources-webhook-info-retrieve({id})
。如果返回
exists: true
,请勿再次调用create-webhook——每次成功调用都会注册一个新的外部端点,导致重复投递。
  1. 调用
    external-data-sources-create-webhook-create({id})
    。PostHog会:
    • 创建用于接收Webhook POST请求的HogFunction,
    • 构建从外部事件类型到PostHog schema ID的schema_mapping,
    • 调用源的API(例如Stripe)注册Webhook URL并订阅相关事件,
    • 在Stripe上自动捕获
      signing_secret
      并安全存储。
    返回
    {success, webhook_url, error}
    。成功后将
    webhook_url
    告知用户留存——但用户无需粘贴该URL,注册已完成。
  2. 如果
    success: false
    且返回权限错误,例如“API密钥没有创建Webhook的权限”:
    • HogFunction仍会创建,但处于禁用状态。
    • 请用户使用返回的
      webhook_url
      在源的仪表盘中手动创建Webhook。
    • 让用户从源的Webhook设置中复制签名密钥。
    • 调用
      external-data-sources-update-webhook-inputs-create({id}, {inputs: {signing_secret: "whsec_..."}})
      存储密钥。HogFunction会获取该密钥并验证传入的负载。
  3. 调用
    external-data-sources-webhook-info-retrieve({id})
    验证。健康的Webhook应满足
    exists: true
    external_status.status: "enabled"
    且无
    error
Webhook是批量同步的补充。支持Webhook的schema的首次加载仍通过轮询完成(完成后
initial_sync_complete
会变为true);之后,Webhook会成为主要的 ingestion 路径。Webhook schema仍会有
sync_frequency
,用于定期调度批量刷新作为安全保障。这是预期行为——无需“修复”。

Step 7 — Confirm and explain what happens next

步骤7 — 确认并说明后续操作

After creation (and, for webhook schemas, after Step 6):
  • Call
    external-data-schemas-list
    to show the user the initial state.
  • Explain: every enabled schema enters
    Running
    , then moves to
    Completed
    when the first sync finishes. First syncs can take anywhere from seconds to hours depending on row count — a multi-million-row table is fine, just slow.
  • Tell them how to query:
    SELECT * FROM {prefix}_{table_name} LIMIT 10
    in HogQL.
  • Offer to check back in a few minutes to confirm the initial syncs succeeded.
创建完成后(对于Webhook schema,需完成步骤6后):
  • 调用
    external-data-schemas-list
    向用户展示初始状态。
  • 说明:每个启用的schema会进入
    Running
    状态,首次同步完成后变为
    Completed
    。首次同步的耗时从几秒到几小时不等,取决于行数——数百万行的表是正常的,只是速度较慢。
  • 告知用户查询方式:在HogQL中使用
    SELECT * FROM {prefix}_{table_name} LIMIT 10
  • 可主动提出几分钟后再次检查,确认首次同步是否成功。

CDC setup for Postgres (optional, when requested)

Postgres的CDC设置(可选,仅当用户请求时)

If the user wants near-real-time replication from Postgres:
  1. Before calling db-schema, run
    external-data-sources-check-cdc-prerequisites-create
    with their Postgres creds. It returns
    {valid, errors[]}
    listing anything missing (wal_level, replication slot, publication, permissions).
  2. If
    valid: false
    , present the errors and ask the user to fix on the Postgres side. Don't try to create a CDC source that will immediately fail.
  3. Once prerequisites pass, proceed to db-schema and create. Set
    sync_type: "cdc"
    on the tables that need it, and include
    primary_key_columns
    for each (CDC requires them).
如果用户想要从Postgres进行近实时复制:
  1. 在调用db-schema前,使用用户的Postgres凭证调用
    external-data-sources-check-cdc-prerequisites-create
    。返回
    {valid, errors[]}
    ,列出所有缺失的条件(wal_level、复制槽、发布、权限)。
  2. 如果
    valid: false
    ,向用户展示错误信息,并请用户在Postgres端修复。不要尝试创建会立即失败的CDC源。
  3. 前置条件满足后,继续执行db-schema和创建步骤。在需要的表上设置
    sync_type: "cdc"
    ,并为每个表添加
    primary_key_columns
    (CDC需要主键)。

Important notes

重要注意事项

  • Always validate creds with db-schema before create. The create endpoint will accept invalid creds and then fail asynchronously — the source appears in the list with status
    Error
    and no tables. Skipping the validation step just pushes the failure into the background.
  • Present the table list before creating. Large databases may have hundreds of tables. Don't auto-select them all — row counts and relevance matter for billing. Let the user opt in explicitly.
  • Don't invent schemas. Every entry in the
    schemas
    array must correspond to a real table from the db-schema response. You can't "also add an orders table" unless db-schema found one.
  • Prefix is load-bearing. It's part of every HogQL query the user will ever write against these tables. Pick something short, descriptive, and not already taken.
  • OAuth sources are different. Hubspot, Salesforce, Google Ads etc. need the user to authorize via the PostHog UI. Direct them there — don't try to collect OAuth tokens in chat.
  • Webhooks are a separate step after create. Setting
    sync_type: "webhook"
    on a schema doesn't register the webhook — the
    create-webhook
    call does. Always follow create → create-webhook → webhook-info for webhook-type schemas, and never leave a webhook schema dangling without registration (it just won't receive events).
  • Webhook support is source-specific and sparse. Currently only Stripe implements
    WebhookSource
    . Don't promise webhooks for Hubspot, Salesforce, or Postgres — they'll use polling sync.
  • Row counts drive billing. Warehouse syncing is metered by rows synced. A chatty 500M-row events table synced hourly is very different from a 10k-row dimension table synced daily. Flag large tables and offer longer sync frequencies (
    sync_frequency: "24hour"
    ) as the default.
  • 创建前务必使用db-schema验证凭证。 创建端点会接受无效凭证,然后异步失败——源会出现在列表中,状态为
    Error
    且无表。跳过验证步骤只会将失败隐藏到后台。
  • 创建前先展示表列表。 大型数据库可能有数百张表。不要自动全选——行数和相关性会影响计费。让用户明确选择需要同步的表。
  • 不要自行构造schemas。
    schemas
    数组中的每个条目必须对应db-schema响应中的真实表。除非db-schema发现了
    orders
    表,否则不能“添加orders表”。
  • 前缀至关重要。 它是用户针对这些表编写的每个HogQL查询的一部分。选择简短、描述性强且未被使用的前缀。
  • OAuth源有所不同。 Hubspot、Salesforce、Google Ads等源需要用户通过PostHog UI授权。引导用户前往该页面——不要尝试在聊天中收集OAuth令牌。
  • Webhook是创建后的独立步骤。 在schema上设置
    sync_type: "webhook"
    不会自动注册Webhook——需要调用
    create-webhook
    。对于Webhook类型的schema,务必遵循创建→创建Webhook→检查Webhook信息的流程,不要让Webhook schema处于未注册状态(否则无法接收事件)。
  • Webhook支持是源特定的,且目前较少。 目前仅Stripe实现了
    WebhookSource
    。不要向Hubspot、Salesforce或Postgres用户承诺Webhook支持——这些源使用轮询同步。
  • 行数影响计费。 仓库同步按同步的行数计费。一个有5亿行的事件表每小时同步,与一个有1万行的维度表每天同步,成本差异极大。标记大型表,并建议将较长的同步频率(
    sync_frequency: "24hour"
    )设为默认值。