chief-data-officer-advisor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Chief Data Officer Advisor

首席数据官（CDO）顾问

Strategic data leadership for startup CDOs and founders without one. Four decisions, no surveys:

Can we train our model on this data? — origin × consent × use-case matrix
Warehouse, lakehouse, or mesh — and what do we build vs buy? — stage-driven architecture
What is our customer data worth? — strategic value + M&A multiplier + productization paths
What data role do we hire next? — stage-to-role map, centralize-vs-embed trigger

This skill does not cover tactical data engineering. For schema design, observability, query optimization, RAG, or ML platform implementation, see

engineering/database-designer/

engineering/observability-designer/

engineering/data-quality-auditor/

engineering/sql-database-assistant/

engineering/rag-architect/

engineering/llm-cost-optimizer/

为初创企业的CDO及未配备CDO的创始人提供战略性数据领导力服务。聚焦四大决策，无需调研：

我们能否使用该数据训练模型？ —— 来源×同意×用例矩阵分析
选择数据仓库、湖仓还是数据网格？自建还是采购？ —— 基于企业阶段的架构方案
我们的客户数据价值几何？ —— 战略价值+并购乘数+产品化路径
我们下一步应招聘哪个数据岗位？ —— 阶段-岗位映射图、集中式vs嵌入式团队触发条件

本技能不涵盖战术性数据工程内容。如需了解 schema 设计、可观测性、查询优化、RAG 或 ML 平台实施，请查看

engineering/database-designer/

、

engineering/observability-designer/

、

engineering/data-quality-auditor/

、

engineering/sql-database-assistant/

、

engineering/rag-architect/

、

engineering/llm-cost-optimizer/

。

Keywords

关键词

CDO, chief data officer, AI training data, consent provenance, training rights, GDPR Article 6 lawful basis, GDPR Article 22, EU AI Act high-risk, ePrivacy, copyright fair use, hiQ v. LinkedIn, scraped data, synthetic data, data product, data mesh, lakehouse, medallion architecture, dbt, Snowflake, BigQuery, Databricks, Fivetran, Airbyte, reverse ETL, feature store, customer data as asset, data monetization, data productization, anonymization, k-anonymity, differential privacy, M&A data diligence, data org, analytics engineer, data engineer, data scientist, data product manager, centralize vs embed, hub and spoke

CDO、首席数据官、AI训练数据、同意溯源、训练权利、GDPR第6条合法依据、GDPR第22条、欧盟AI法案高风险、电子隐私法、版权合理使用、hiQ诉LinkedIn案、爬取数据、合成数据、数据产品、数据网格（data mesh）、湖仓（lakehouse）、奖章架构（medallion architecture）、dbt、Snowflake、BigQuery、Databricks、Fivetran、Airbyte、反向ETL、特征存储、客户数据资产、数据变现、数据产品化、匿名化、k-匿名性、差分隐私、并购数据尽职调查、数据组织、分析工程师、数据工程师、数据科学家、数据产品经理、集中式vs嵌入式、hub-and-spoke模式

Quick Start

快速开始

bash

undefined

bash

undefined

Audit data sources for AI training eligibility

审核数据源的AI训练合规性

python scripts/ai_training_data_audit.py # uses embedded sample python scripts/ai_training_data_audit.py path/to/sources.json

python scripts/ai_training_data_audit.py # 使用内置示例数据 python scripts/ai_training_data_audit.py path/to/sources.json

Pick data architecture + build-vs-buy + sequencing

选择数据架构+自建/采购方案+实施顺序

python scripts/data_product_strategy_picker.py # uses embedded Series A SaaS python scripts/data_product_strategy_picker.py path/to/profile.json

python scripts/data_product_strategy_picker.py # 使用内置A轮SaaS企业示例 python scripts/data_product_strategy_picker.py path/to/profile.json

Value the customer data corpus + productization viability

评估客户数据资产价值+产品化可行性

python scripts/data_asset_valuator.py # uses embedded B2B sample python scripts/data_asset_valuator.py path/to/corpus.json

undefined

python scripts/data_asset_valuator.py # 使用内置B2B示例数据 python scripts/data_asset_valuator.py path/to/corpus.json

undefined

Key Questions (ask these first)

核心问题（优先询问）

What decision does this data drive? (If none, why are we collecting it?)
What's the consent provenance of every source we want to train on? (TOS-only is not the same as explicit opt-in.)
Who are the internal data consumers, and how many distinct domains do they span? (Drives centralize-vs-embed and warehouse-vs-mesh.)
In an M&A scenario, is our data a moat or a liability? (Customer carve-outs in MSAs can flip the answer.)
Are we hiring an analytics engineer or a data scientist next? (They solve different problems; founders confuse them.)
Have we run an anonymization audit before any external sharing? (k-anonymity ≥ 5 is the floor, not the ceiling.)

该数据将驱动什么决策？（如果没有，我们为何要收集它？）
我们用于训练的每个数据源的同意溯源情况如何？（仅服务条款（TOS）不等于明确的 opt-in 同意。）
内部数据使用者有哪些？他们跨越多少个不同业务域？（这将决定集中式vs嵌入式团队架构，以及数据仓库vs数据网格的选择。）
在并购场景中，我们的数据是护城河还是负债？（主服务协议（MSA）中的客户剥离条款可能会反转答案。）
我们下一步应招聘分析工程师还是数据科学家？（他们解决的问题不同，创始人常混淆两者。）
在对外共享数据前，我们是否开展过匿名化审核？（k-匿名性≥5是最低标准，而非上限。）

Core Responsibilities

核心职责

1. AI Training Data Rights

1. AI训练数据权利

The 2026 question every startup is facing: can we use customer data to train our model?

The answer is rarely binary. It depends on three independent dimensions:

Dimension	Values
Origin	1st-party-explicit-opt-in / 1st-party-TOS-only / partner-licensed / scraped / synthetic
Data class	Anonymous aggregate / behavioral / PII / 3rd-party content / regulated (PHI, PCI, kids)
Use case	In-product personalization / fine-tune our model / train foundation model / external sharing

Each combination produces GO / MITIGATE / NO-GO. Run

ai_training_data_audit.py

on a JSON inventory of sources.

See

references/ai_training_data_rights.md

for the full matrix + GDPR Art. 6 lawful basis decision tree + EU AI Act high-risk triggers.

2026年每家初创企业都将面临的问题：我们能否使用客户数据训练模型？

答案很少是非黑即白的，它取决于三个独立维度：

维度	取值
数据来源	第一方明确opt-in同意 / 第一方仅服务条款（TOS） / 合作伙伴授权 / 爬取数据 / 合成数据
数据类别	匿名聚合数据 / 行为数据 / 个人身份信息（PII） / 第三方内容 / 受监管数据（PHI、PCI、未成年人数据）
使用场景	产品内个性化 / 微调自有模型 / 训练基础模型 / 对外共享

每个组合对应允许/需缓解风险/禁止三种结果。请在数据源JSON清单上运行

ai_training_data_audit.py

。

完整的决策矩阵+GDPR第6条合法依据决策树+欧盟AI法案高风险触发条件，请查看

references/ai_training_data_rights.md

。

2. Data Product Strategy

2. 数据产品策略

Architecture choice (warehouse vs lakehouse vs mesh) is stage-driven, not preference-driven:

Warehouse only (Snowflake / BigQuery / Postgres): ≤5 data consumers, <2TB, no ML use cases
Lakehouse (warehouse + object storage, often Databricks or Snowflake-with-Iceberg): 5–25 data consumers, 2TB–1PB, 1–3 ML use cases
Data mesh: 25+ data consumers across 4+ domains, federated ownership culture in place

Build vs buy is decided per layer:

Layer	Buy unless	Build only if
Storage / warehouse	Never build	(You’re a data infra company)
ELT / ingest	Never build	Source isn’t supported by Fivetran/Airbyte
Modeling (dbt)	Always build	This is your IP
BI / dashboards	Buy at <100 consumers	Embedded analytics for customers
Feature store	Defer until 3+ prod models	Then build OR buy Tecton/Hopsworks
ML platform	Defer until 5+ prod models	Then buy SageMaker/Vertex/Databricks

Run

data_product_strategy_picker.py

for a stage-specific recommendation. See

references/data_product_strategy.md

for kill criteria per architecture and the build-vs-buy decision tree.

架构选择（数据仓库vs湖仓vs数据网格）由企业阶段决定，而非偏好：

仅数据仓库（Snowflake / BigQuery / Postgres）：数据使用者≤5人，数据量<2TB，无机器学习用例
湖仓（数据仓库+对象存储，通常为Databricks或搭配Iceberg的Snowflake）：数据使用者5–25人，数据量2TB–1PB，1–3个机器学习用例
数据网格：数据使用者≥25人，跨越4个以上业务域，已建立联邦所有权文化

自建vs采购按层级决定：

层级	除非以下情况否则采购	仅在以下情况自建
存储/数据仓库	绝不自建	（你的公司是数据基础设施提供商）
ELT/数据采集	绝不自建	数据源未被Fivetran/Airbyte支持
建模（dbt）	始终自建	这是你的核心知识产权（IP）
BI/仪表盘	数据使用者<100人时采购	为客户提供嵌入式分析
特征存储	推迟到有3个以上生产模型时	之后可自建或采购Tecton/Hopsworks
ML平台	推迟到有5个以上生产模型时	之后采购SageMaker/Vertex/Databricks

请运行

data_product_strategy_picker.py

获取基于企业阶段的推荐方案。各架构淘汰标准及自建vs采购决策树，请查看

references/data_product_strategy.md

。

3. B2B Customer-Data-as-Asset

3. B2B客户数据资产

The shift: at Series B+, customer data is no longer just operational — it’s an asset that can be:

A defensibility moat (replicating requires years of customer cohort)
An M&A multiplier (1.2x–2x ARR uplift for strategic buyers)
A direct revenue stream (anonymized industry benchmarks, embedding endpoints, licensing)

But it can also be a liability:

47/380 customers with MSA carve-outs makes productization legally infeasible
Anonymization audits often reveal re-identification risk above tolerable thresholds
Regulatory exposure increases linearly with productization (GDPR Art. 28 processors vs Art. 26 joint controllers)

Run

data_asset_valuator.py

with corpus characteristics to get strategic value score + productization paths + risk-adjusted value.

See

references/customer_data_as_asset.md

for the valuation framework, M&A diligence prep checklist, and contractual constraint audit pattern.

**转变：**在B轮及以后，客户数据不再只是运营工具——它成为一种可实现以下价值的资产：

防御性护城河（复制需要多年的客户群体积累）
并购乘数（对战略买家而言，可带来1.2–2倍的ARR提升）
直接收入来源（匿名行业基准、嵌入端点、授权许可）

但它也可能成为负债：

380个客户中有47个存在MSA剥离条款，导致产品化在法律上不可行
匿名化审核常发现重识别风险超出可容忍阈值
监管风险随产品化程度线性增加（GDPR第28条处理者vs第26条联合控制者）

请结合数据资产特征运行

data_asset_valuator.py

，获取战略价值评分+产品化路径+风险调整后价值。

估值框架、并购尽职调查准备清单、合同约束审核模式，请查看

references/customer_data_as_asset.md

。

4. Data Team Org Evolution

4. 数据团队组织演进

The wrong question: "Should we hire a data scientist?" The right question: "What’s the next decision we can’t make because we lack data, and what role unblocks that?"

Stage-to-role map (B2B SaaS baseline):

Stage	First hire	Then	Then
Pre-seed / seed	Founder-as-analyst (SQL + spreadsheets)	—	—
Series A (Series A)	Analyst	Analytics engineer (dbt)	—
Series B	Data engineer	Senior analyst (embedded in GTM)	Data PM (if 3+ teams need data)
Growth	Manager of analytics	ML engineer (if model is core)	Head of Data
Late-stage	Head of Data → CDO	Specialized: BI, MLE, DPO	Federated owners per domain (mesh)

Centralize-vs-embed trigger: when 3+ functional areas (sales, marketing, product, ops, CS) need bespoke data weekly, the central team becomes the bottleneck. Move to hub-and-spoke (central platform + embedded analysts) before that becomes a hiring crisis.

See

references/data_team_org_evolution.md

错误的问题：“我们应该招聘数据科学家吗？” 正确的问题：“我们目前因缺少数据或分析而无法做出的下一个决策是什么？哪个岗位能解决这个问题？”

基于B2B SaaS的阶段-岗位映射：

阶段	首个招聘岗位	后续招聘	再后续招聘
种子前/种子轮	创始人兼任分析师（SQL+电子表格）	—	—
A轮	分析师	分析工程师（dbt）	—
B轮	数据工程师	高级分析师（嵌入GTM团队）	数据产品经理（如果有3个以上团队需要数据支持）
增长阶段	分析经理	ML工程师（如果模型是核心业务）	数据负责人
后期阶段	数据负责人→CDO	专业岗位：BI、MLE、DPO	按业务域划分的联邦所有者（数据网格模式）

**集中式vs嵌入式团队触发条件：**当3个以上职能领域（销售、营销、产品、运营、客户成功）每周都需要定制化数据时，集中式团队会成为瓶颈。在这演变为招聘危机前，切换到hub-and-spoke模式（集中平台+嵌入式分析师）。

详情请查看

references/data_team_org_evolution.md

。

Workflows

工作流程

Workflow 1: AI Training Decision (1 hour)

工作流程1：AI训练决策（1小时）

Goal: Decide whether a specific data source can train a specific use case.

bash

undefined

**目标：**决定特定数据源是否可用于特定训练场景。

bash

undefined

1. Build sources.json with one entry per data source

1. 构建sources.json，每个数据源对应一个条目

2. Run the audit

2. 运行审核脚本

python scripts/ai_training_data_audit.py sources.json

3. For each MITIGATE: assign owner + remediation

3. 对每个“需缓解风险”的条目：分配负责人+整改方案

4. For each NO-GO: document the kill reason for the legal log

4. 对每个“禁止”的条目：在法律日志中记录淘汰原因

5. Cross-check with cs-general-counsel-advisor on top-3 mitigation items

5. 与cs-general-counsel-advisor交叉核对前3项风险缓解措施

6. Log via /cs:decide

6. 通过/cs:decide记录决策

undefined

undefined

Workflow 2: Architecture Decision (1 day)

工作流程2：架构决策（1天）

Goal: Pick warehouse / lakehouse / mesh and the build-vs-buy split for the next 12 months.

bash

python scripts/data_product_strategy_picker.py profile.json

**目标：**选择数据仓库/湖仓/数据网格，并确定未来12个月的自建vs采购比例。

bash

python scripts/data_product_strategy_picker.py profile.json

Cross-check with cs-cto-advisor on engineering capacity

与cs-cto-advisor交叉核对工程能力

Cross-check with cs-cfo-advisor on 3-year TCO

与cs-cfo-advisor交叉核对3年总拥有成本（TCO）

Log via /cs:decide; consider /cs:freeze 90 if signing a multi-year SaaS contract

通过/cs:decide记录决策；若签署多年SaaS合同，考虑使用/cs:freeze 90

undefined

undefined

Workflow 3: Data Asset Valuation for M&A Prep (3 days)

工作流程3：并购准备中的数据资产估值（3天）

Goal: Value the data corpus and prepare for due diligence.

Inventory the corpus: size, freshness, exclusivity, customer overlap, contractual restrictions
Run
```
data_asset_valuator.py
```
Run the M&A diligence prep checklist in
```
customer_data_as_asset.md
```
Surface contractual carve-outs to cs-general-counsel-advisor for re-papering plan
Decide productization path (benchmark report / embedding endpoint / direct license)
Log via /cs:decide

**目标：**评估数据资产价值并为尽职调查做准备。

盘点数据资产：规模、新鲜度、排他性、客户重叠度、合同限制
运行
```
data_asset_valuator.py
```
完成
```
customer_data_as_asset.md
```
中的并购尽职调查准备清单
将合同剥离条款提交给cs-general-counsel-advisor，制定重新起草计划
确定产品化路径（基准报告/嵌入端点/直接授权）
通过/cs:decide记录决策

Workflow 4: Data Team Roadmap (1 week)

工作流程4：数据团队路线图（1周）

Goal: Build the next 18 months of data hires aligned to business decisions.

List the top 5 decisions the business can’t make today due to missing data or analysis
Map each decision to the role that unblocks it
Sequence hires (one role at a time, ramp before next)
Cross-check with cs-chro-advisor on comp bands and leveling
Identify the centralize-vs-embed trigger date

**目标：**制定未来18个月的数据岗位招聘计划，与业务决策对齐。

列出当前业务因缺少数据或分析而无法做出的前5项决策
将每项决策映射到能解决问题的岗位
安排招聘顺序（一次一个岗位，完成入职后再进行下一个）
与cs-chro-advisor交叉核对薪酬范围和职级
确定集中式vs嵌入式团队的触发日期

Output Standards (when invoked via cs-cdo-advisor)

输出标准（当通过cs-cdo-advisor调用时）

**Bottom Line:** [one sentence — decision and rationale]
**The Decision:** [one of the 4 framings]
**The Evidence:** [numbers, not adjectives]
**How to Act:** [3 concrete next steps]
**Your Decision:** [the call only the founder can make]

**核心结论：**[一句话——决策及理由]
**决策内容：**[四大决策框架之一]
**支撑证据：**[数据，而非形容词]
**行动步骤：**[3个具体的下一步动作]
**创始人决策项：**[仅创始人能做出的最终决定]

Adjacent Skills

关联技能

```
../cto-advisor/
```
— architecture capacity, scaling cliffs
```
../ciso-advisor/
```
— data security, threat modeling for productized data
```
../general-counsel-advisor/
```
— contractual constraints, DPA, training-data rights
```
../cfo-advisor/
```
— build-vs-buy TCO, M&A valuation math
```
../chro-advisor/
```
— data team hiring, leveling, comp
```
../../../engineering/database-designer/
```
— tactical schema design
```
../../../engineering/rag-architect/
```
— tactical AI/RAG implementation

../../../engineering/llm-cost-optimizer/

— model cost management

```
../cto-advisor/
```
—— 架构能力、扩展瓶颈
```
../ciso-advisor/
```
—— 数据安全、产品化数据的威胁建模
```
../general-counsel-advisor/
```
—— 合同约束、数据处理协议（DPA）、训练数据权利
```
../cfo-advisor/
```
—— 自建vs采购总拥有成本、并购估值计算
```
../chro-advisor/
```
—— 数据团队招聘、职级、薪酬
```
../../../engineering/database-designer/
```
—— 战术性schema设计
```
../../../engineering/rag-architect/
```
—— 战术性AI/RAG实施

../../../engineering/llm-cost-optimizer/

—— 模型成本管理

References

参考资料

ai_training_data_rights.md — The training-rights matrix + GDPR Art. 6 / EU AI Act decision tree
data_product_strategy.md — Warehouse / lakehouse / mesh kill criteria + build-vs-buy decision tree
customer_data_as_asset.md — Valuation framework + M&A diligence prep + productization paths
data_team_org_evolution.md — Stage-to-role map + centralize-vs-embed trigger

Version: 1.0.0 Status: Production Ready Disclaimer: Decisions touching training data rights, data productization, or M&A data diligence should involve qualified counsel. This skill surfaces decisions and tradeoffs — it does not replace legal review.

ai_training_data_rights.md —— 训练权利矩阵+GDPR第6条/欧盟AI法案决策树
data_product_strategy.md —— 数据仓库/湖仓/数据网格淘汰标准+自建vs采购决策树
customer_data_as_asset.md —— 估值框架+并购尽职调查准备+产品化路径
data_team_org_evolution.md —— 阶段-岗位映射+集中式vs嵌入式团队触发条件

**版本：**1.0.0 **状态：**已就绪可投入生产 **免责声明：**涉及训练数据权利、数据产品化或并购数据尽职调查的决策应咨询合格法律顾问。本技能仅呈现决策及权衡方案，不能替代法律审查。