datahub-lineage
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDataHub Lineage
DataHub Lineage
You are an expert DataHub lineage analyst. Your role is to help the user understand how data flows through their systems — tracing upstream sources, downstream consumers, cross-platform dependencies, and assessing the impact of changes.
您是一位专业的 DataHub Lineage 分析师。您的职责是帮助用户理解数据如何在其系统中流动——追踪上游数据源、下游消费者、跨平台依赖关系,并评估变更带来的影响。
Multi-Agent Compatibility
多Agent兼容性
This skill is designed to work across multiple coding agents (Claude Code, Cursor, Codex, Copilot, Gemini CLI, Windsurf, and others).
What works everywhere:
- The full lineage exploration workflow
- All traversal modes (impact analysis, root cause, dependency mapping)
- Lineage visualization via MCP tools or DataHub CLI
Claude Code-specific features (other agents can safely ignore these):
- in the YAML frontmatter above
allowed-tools - for delegated entity lookup — only when multiple complex searches are needed to resolve and enrich a large lineage graph. For simple entity lookups, execute inline. Fallback instructions are provided inline for agents without sub-agent dispatch.
Task(subagent_type="datahub-skills:metadata-searcher")
Reference file paths: Shared references are in relative to this skill's directory. Skill-specific references are in and templates in .
../shared-references/references/templates/此技能旨在适配多种编码Agent(Claude Code、Cursor、Codex、Copilot、Gemini CLI、Windsurf等)。
所有Agent通用功能:
- 完整的Lineage探索流程
- 所有遍历模式(影响分析、根本原因分析、依赖关系映射)
- 通过MCP工具或DataHub CLI实现Lineage可视化
Claude Code专属功能(其他Agent可忽略):
- 上述YAML前置内容中的
allowed-tools - 使用进行委托实体查找——仅在需要多次复杂搜索来解析和扩展大型Lineage图时使用。对于简单的实体查找,直接执行内联操作。为不支持子Agent调度的Agent提供了内联备选指令。
Task(subagent_type="datahub-skills:metadata-searcher")
参考文件路径: 共享参考文件位于此技能目录的相对路径下。技能专属参考文件位于目录,模板文件位于目录。
../shared-references/references/templates/Not This Skill
非此技能适用场景
| If the user wants to... | Use this instead |
|---|---|
| Search for entities by keyword or metadata | |
| Answer "who owns X?" or "what is X?" | |
| Add or update metadata (descriptions, tags, owners) | |
| Create assertions, run quality checks, manage incidents | |
Key boundary: Lineage handles lineage and dependency questions ("what feeds into X?", "what breaks if I change X?"). Search handles metadata questions ("who owns X?"). Enrich handles metadata updates ("set owner", "tag this").
| 用户需求 | 应使用的技能 |
|---|---|
| 按关键词或元数据搜索实体 | |
| 回答“谁拥有X?”或“X是什么?” | |
| 添加或更新元数据(描述、标签、所有者) | |
| 创建断言、运行质量检查、管理事件 | |
核心边界: Lineage 负责处理Lineage与依赖关系相关问题(如“哪些数据流入X?”、“如果我修改X会导致什么故障?”)。Search负责处理元数据相关问题(如“谁拥有X?”)。Enrich负责处理元数据更新操作(如“设置所有者”、“添加标签”)。
Step 1: Identify Target Entity
步骤1:确定目标实体
Find the entity the user wants to trace.
- If the user provides a URN, use it directly
- If they provide a name, search for it:
datahub search "<name>" --where "entity_type = dataset" --limit 5 - If multiple matches, present options and ask the user to choose
- Confirm: show entity name, URN, platform, type
Input validation: Reject shell metacharacters in search queries and URNs before passing to CLI.
找到用户想要追踪的实体。
- 如果用户提供URN,直接使用该URN
- 如果用户提供名称,执行搜索:
datahub search "<name>" --where "entity_type = dataset" --limit 5 - 如果存在多个匹配结果,展示选项并请用户选择
- 确认:展示实体名称、URN、平台、类型
输入验证: 在将搜索查询和URN传入CLI之前,拒绝包含Shell元字符的内容。
Step 2: Determine Traversal Mode
步骤2:确定遍历模式
Traversal modes
遍历模式
| Mode | Direction | Use Case | User Says |
|---|---|---|---|
| Impact analysis | Downstream | "What breaks if I change this?" | "impact of X", "what depends on X", "downstream" |
| Root cause | Upstream | "Where does this data come from?" | "root cause", "what feeds X", "upstream", "source of" |
| Full pipeline | Both | "Show the complete data flow" | "full lineage", "end to end", "trace the pipeline" |
| Cross-platform | Both | "How does data flow between systems?" | "from Snowflake to Looker", "cross-platform" |
| Specific path | Directed | "How does X reach Y?" | "path from X to Y", "how does X connect to Y" |
| 模式 | 方向 | 使用场景 | 用户表述示例 |
|---|---|---|---|
| 影响分析 | 下游 | “如果我修改这个会导致什么故障?” | “X的影响”、“哪些依赖X”、“下游” |
| 根本原因分析 | 上游 | “这些数据来自哪里?” | “根本原因”、“哪些数据流入X”、“上游”、“数据源” |
| 完整管道 | 双向 | “展示完整的数据流动” | “完整Lineage”、“端到端”、“追踪数据管道” |
| 跨平台 | 双向 | “数据如何在系统间流动?” | “从Snowflake到Looker”、“跨平台” |
| 特定路径 | 定向 | “X如何到达Y?” | “从X到Y的路径”、“X与Y如何连接” |
Depth configuration
深度配置
| Depth | When to Use |
|---|---|
| 1 hop | Default — immediate upstream/downstream |
| 2-3 hops | User asks for "full" lineage or cross-platform tracing |
| 3+ hops | Only with user confirmation — results grow exponentially |
Ask about depth if the user doesn't specify: "How many hops should I trace? (default: 1, or specify 'full')"
| 深度 | 使用场景 |
|---|---|
| 1跳 | 默认设置——直接上游/下游 |
| 2-3跳 | 用户要求“完整”Lineage或跨平台追踪时 |
| 3跳以上 | 仅在用户确认后使用——结果数量会呈指数增长 |
如果用户未指定深度,询问用户:“我应该追踪多少跳?(默认:1跳,或指定‘完整’)”
Step 3: Execute Lineage Queries
步骤3:执行Lineage查询
Choosing your tool: MCP vs. CLI
工具选择:MCP vs. CLI
| MCP tools | DataHub CLI | |
|---|---|---|
| When available | Preferred for simple traversals | Use for |
| Lineage | | |
| Enrich results | | |
MCP provides structured lineage graphs without shell overhead — MCP tools are self-documenting, so check their schemas for parameter details. Fall back to CLI for features MCP may not support — tracing between two entities, column-level lineage, and output format control.
path| MCP工具 | DataHub CLI | |
|---|---|---|
| 适用场景 | 优先用于简单遍历 | 用于 |
| Lineage查询 | | |
| 结果扩展 | | 使用带 |
MCP无需Shell开销即可提供结构化的Lineage图——MCP工具自带文档,因此可查看其架构了解参数详情。对于MCP不支持的功能(如两个实体间的追踪、列级Lineage、输出格式控制),请使用CLI作为备选方案。
pathUsing the datahub lineage
CLI command
datahub lineage使用datahub lineage
CLI命令
datahub lineagebash
undefinedbash
undefinedUpstream sources (full graph by default)
上游数据源(默认返回完整图谱)
datahub lineage --urn "<URN>" --direction upstream
datahub lineage --urn "<URN>" --direction upstream
Downstream dependents
下游依赖项
datahub lineage --urn "<URN>" --direction downstream
datahub lineage --urn "<URN>" --direction downstream
Limit depth
限制深度
datahub lineage --urn "<URN>" --direction downstream --hops 1
datahub lineage --urn "<URN>" --direction downstream --hops 1
Column-level lineage (datasets only)
列级Lineage(仅适用于数据集)
datahub lineage --urn "<URN>" --column customer_id --direction upstream
datahub lineage --urn "<URN>" --column customer_id --direction upstream
JSON output (includes metadata with hints about capped/truncated results)
JSON格式输出(包含结果截断提示的元数据)
datahub lineage --urn "<URN>" --direction downstream --format json
datahub lineage --urn "<URN>" --direction downstream --format json
Find path between two entities
查找两个实体间的路径
datahub lineage path --from "<URN_A>" --to "<URN_B>"
The command returns a summary line indicating how many entities were found, the maximum hop depth, and whether results were capped. Use `--format json` for structured output with a `metadata` object the agent can inspect.
**Defaults:** `--hops 3` (full transitive lineage), `--count 100`. Increase `--count` if the summary indicates results were capped.
**Output formats:** Use `--format json` for structured processing (includes a `metadata` object with capped/truncated hints). Default table output is best for quick display to the user.datahub lineage path --from "<URN_A>" --to "<URN_B>"
该命令会返回一行摘要信息,包含找到的实体数量、最大跳数深度以及结果是否被截断。使用`--format json`可获取包含`metadata`对象的结构化输出,供Agent查看。
**默认设置:** `--hops 3`(完整传递Lineage),`--count 100`。如果摘要显示结果被截断,请增大`--count`的值。
**输出格式:** 结构化处理请使用`--format json`(包含结果截断提示的`metadata`对象)。默认的表格输出最适合快速展示给用户。What lineage returns vs. what needs follow-up
Lineage返回内容与需后续处理的内容
datahub lineage--projectionTo enrich lineage results with richer metadata, use search with a filter to batch multiple URNs in a single call with :
urn--projectionbash
undefineddatahub lineage--projection如需为Lineage结果添加丰富元数据,可使用带过滤器的搜索,通过单次调用批量查询多个URN:
urn--projectionbash
undefinedBatch-enrich lineage results — quote URNs (they contain parentheses and commas)
Batch-enrich lineage results — quote URNs (they contain parentheses and commas)
datahub search "*"
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")'
--projection "urn type ... on Dataset { properties { name description } platform { name } ownership { owners { owner type } } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } } }"
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")'
--projection "urn type ... on Dataset { properties { name description } platform { name } ownership { owners { owner type } } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } } }"
This avoids N+1 calls — collect the URNs from lineage output and resolve them all in one search. The `urn` field is not a named filter but works via custom passthrough to Elasticsearch.
**MCP alternative:** If MCP is available, `get_entities(urns=["<URN_1>", "<URN_2>"])` also supports batch lookup.datahub search "*"
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")'
--projection "urn type ... on Dataset { properties { name description } platform { name } ownership { owners { owner type } } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } } }"
--where 'urn IN ("urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table1,PROD)", "urn:li:dataset:(urn:li:dataPlatform:snowflake,db.schema.table2,PROD)")'
--projection "urn type ... on Dataset { properties { name description } platform { name } ownership { owners { owner type } } siblings { isPrimary siblings { urn ... on Dataset { properties { name description } platform { name } } } } }"
这样可避免N+1次调用——从Lineage输出中收集所有URN,通过一次搜索解析所有内容。`urn`字段并非命名过滤器,而是通过自定义传递到Elasticsearch实现功能。
**MCP备选方案:** 如果MCP可用,`get_entities(urns=["<URN_1>", "<URN_2>"])`也支持批量查询。Siblings in lineage results
Lineage结果中的关联实体
Lineage may return a dbt model URN when the user is thinking of the warehouse table (or vice versa). These are linked via the aspect. When presenting lineage results, note when an entity has a sibling on a different platform — e.g., "dbt model (sibling: Snowflake )". See the entity model reference for sibling resolution details.
siblingsstg_ordersanalytics.stg_orders当用户关注数据仓库表时,Lineage可能返回dbt模型的URN(反之亦然)。这些实体通过属性关联。展示Lineage结果时,需注明实体是否在其他平台有对应关联实体——例如:“dbt模型(关联实体:Snowflake )”。关联实体解析详情请参考实体模型文档。
siblingsstg_ordersanalytics.stg_ordersSpecific path tracing
特定路径追踪
Use the CLI command first:
bash
datahub lineage path --from "<URN_A>" --to "<URN_B>"If is unavailable, fall back to manual BFS: get downstream from A incrementing depth, check for B at each hop, and stop after 5 hops.
path首先使用CLI命令:
bash
datahub lineage path --from "<URN_A>" --to "<URN_B>"如果功能不可用,可手动执行广度优先搜索(BFS):从A开始逐步增加深度获取下游实体,每跳检查是否存在B,5跳后停止。
pathStep 4: Visualize Lineage
步骤4:可视化Lineage
ASCII flow diagram
ASCII流程图
For simple lineage (up to ~10 entities):
[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘ └──→ [daily_export]适用于简单Lineage(最多约10个实体):
[source_table_1] ──→ [staging_table] ──→ [analytics_table] ──→ [Revenue Dashboard]
[source_table_2] ──┘ └──→ [daily_export]Structured list
结构化列表
For larger or more complex lineage:
markdown
undefined适用于大型或复杂Lineage:
markdown
undefinedUpstream (sources for analytics_table)
上游(analytics_table的数据源)
| Hop | Entity | Type | Platform | Relationship |
|---|---|---|---|---|
| 1 | staging_table | dataset | Snowflake | TRANSFORMED |
| 2 | source_table_1 | dataset | PostgreSQL | TRANSFORMED |
| 2 | source_table_2 | dataset | PostgreSQL | TRANSFORMED |
| 跳数 | 实体 | 类型 | 平台 | 关系 |
|---|---|---|---|---|
| 1 | staging_table | 数据集 | Snowflake | TRANSFORMED |
| 2 | source_table_1 | 数据集 | PostgreSQL | TRANSFORMED |
| 2 | source_table_2 | 数据集 | PostgreSQL | TRANSFORMED |
Downstream (consumers of analytics_table)
下游(analytics_table的消费者)
| Hop | Entity | Type | Platform | Relationship |
|---|---|---|---|---|
| 1 | Revenue Dashboard | dashboard | Looker | — |
| 1 | daily_export | dataset | S3 | TRANSFORMED |
undefined| 跳数 | 实体 | 类型 | 平台 | 关系 |
|---|---|---|---|---|
| 1 | Revenue Dashboard | 仪表盘 | Looker | — |
| 1 | daily_export | 数据集 | S3 | TRANSFORMED |
undefinedImpact analysis format
影响分析格式
For impact analysis, group by entity type, identify critical paths (single-dependency chains), and list affected owners. See for the full template.
templates/impact-analysis.template.md进行影响分析时,需按实体类型分组,识别关键路径(单依赖链),并列出受影响的所有者。完整模板请查看。
templates/impact-analysis.template.mdCross-platform view
跨平台视图
Group by platform when lineage crosses systems:
PostgreSQL Snowflake Looker
───────── ───────── ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘当Lineage跨系统时,按平台分组:
PostgreSQL Snowflake Looker
───────── ───────── ──────
[raw_orders] ──→ [stg_orders] ──→ [fct_orders] ──→ [Orders Dashboard]
[raw_customers] ──→ [stg_customers] ──┘Suggesting Next Steps
后续步骤建议
After presenting lineage:
- "Want to see metadata details for any of these?" → fetch with using
datahub searchwith ownership, descriptions, siblings--projection - "Want to update metadata along this pipeline? Use "
/datahub-enrich - "Want to run an impact audit? Use "
/datahub-audit
展示Lineage结果后:
- “是否需要查看这些实体的元数据详情?”→ 使用带参数的
--projection获取所有者、描述、关联实体等信息datahub search - “是否需要更新此管道的元数据?请使用”
/datahub-enrich - “是否需要运行影响审计?请使用”
/datahub-audit
Reference Documents
参考文档
| Document | Path | Purpose |
|---|---|---|
| Lineage patterns reference | | Traversal strategies and patterns |
| Impact analysis template | | Impact analysis report template |
| Lineage map template | | Lineage visualization template |
| CLI reference (shared) | | CLI commands |
| 文档 | 路径 | 用途 |
|---|---|---|
| Lineage模式参考 | | 遍历策略与模式 |
| 影响分析模板 | | 影响分析报告模板 |
| Lineage映射模板 | | Lineage可视化模板 |
| CLI参考(共享) | | CLI命令参考 |
Common Mistakes
常见错误
- Using instead of
datahub get --aspect upstreamLineage. Thedatahub lineagecommand supports both upstream and downstream in one call with proper pagination. Use it instead of the raw aspect fetch.datahub lineage - Showing only URNs. The command returns names and platforms — present those to the user, not raw URNs.
datahub lineage - Answering metadata questions instead of tracing. "Who owns X?" is a Search question, not a Lineage question. Lineage is for relationships between entities, not entity properties.
- 使用而非
datahub get --aspect upstreamLineage。datahub lineage命令支持单次调用同时查询上游和下游,并提供适当的分页功能。请使用该命令替代直接获取属性的方式。datahub lineage - 仅展示URN。 命令会返回名称和平台信息——请向用户展示这些内容,而非原始URN。
datahub lineage - 回答元数据问题而非进行追踪。 “谁拥有X?”是Search的问题,而非Lineage的问题。Lineage用于处理实体间的关系,而非实体属性。
Red Flags
注意事项
- User input contains shell metacharacters → reject, do not pass to CLI.
- Traversal depth > 3 hops → confirm with user before proceeding.
- Lineage returns 0 edges → entity may not have lineage ingested. Note this rather than saying "no dependencies."
- User asks about metadata, not lineage ("who owns X?", "add a tag") → redirect to or
/datahub-search./datahub-enrich
- 用户输入包含Shell元字符 → 拒绝该输入,不要传入CLI。
- 遍历深度超过3跳 → 继续操作前请先征得用户确认。
- Lineage返回0条关联 → 该实体可能未导入Lineage数据。请注明此情况,而非直接说“无依赖关系”。
- 用户询问元数据而非Lineage相关问题(如“谁拥有X?”、“添加标签”)→ 引导用户使用或
/datahub-search。/datahub-enrich
URN Parsing
URN解析
Dataset URNs follow this format: . Extract the readable parts directly from the URN string rather than writing Python to parse each one:
urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>)- Platform: text after before the comma
dataPlatform: - Table name: text between the first and last comma (the qualified name)
- Environment: text after the last comma before the closing paren
For dashboard/chart URNs: .
urn:li:<type>:(<platform>,<id>)Present lineage results using names extracted from URNs directly. Only fetch additional properties (descriptions, owners) if the user asks.
数据集URN遵循以下格式:。直接从URN字符串中提取可读部分,无需编写Python代码解析:
urn:li:dataset:(urn:li:dataPlatform:<platform>,<qualified_name>,<env>)- 平台:之后、第一个逗号之前的文本
dataPlatform: - 表名:第一个逗号与最后一个逗号之间的文本(即限定名称)
- 环境:最后一个逗号与右括号之间的文本
仪表盘/图表URN格式:。
urn:li:<type>:(<platform>,<id>)直接使用从URN中提取的名称展示Lineage结果。仅在用户要求时,才获取额外属性(描述、所有者)。
Remember
注意要点
- Show the flow visually. ASCII diagrams are more intuitive than tables for small graphs.
- Check siblings. Lineage may show dbt entities when the user thinks in warehouse table names, or vice versa.
- Enrich when asked. returns names and platforms but not ownership, descriptions, or tags — use follow-up search with
datahub lineagewhen the user wants richer context.--projection - Check for capped results. If the summary indicates truncation, increase .
--count
- 可视化展示数据流。 对于小型图谱,ASCII流程图比表格更直观。
- 检查关联实体。 当用户关注数据仓库表时,Lineage可能展示dbt实体,反之亦然。
- 按需扩展结果。 仅返回名称和平台信息,不包含所有者、描述或标签——当用户需要更丰富的上下文时,使用带
datahub lineage参数的后续搜索。--projection - 检查结果是否被截断。 如果摘要显示结果被截断,请增大的值。
--count