refactoring-dbt-models

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

dbt Refactoring

dbt 重构

Find ALL downstream dependencies before changing. Refactor in small steps. Verify output after each change.

在进行更改前找出所有下游依赖关系。分步进行重构。每次更改后验证输出。

Workflow

工作流

1. Analyze Current Model

1. 分析当前模型

bash

cat models/<path>/<model_name>.sql

Identify refactoring opportunities:

CTEs longer than 50 lines → extract to intermediate model
Logic repeated across models → extract to macro
Multiple joins in sequence → split into steps
Complex WHERE clauses → extract to staging filter

bash

cat models/<path>/<model_name>.sql

识别重构机会：

超过50行的CTE → 提取为中间模型
跨模型重复的逻辑 → 提取为宏
连续多次关联 → 拆分为多个步骤
复杂的WHERE子句 → 提取为 staging 过滤器

2. Find All Downstream Dependencies

2. 找出所有下游依赖关系

CRITICAL: Never refactor without knowing impact.

bash

undefined

关键：绝不要在不了解影响范围的情况下进行重构。

bash

undefined

Get full dependency tree (model and all its children)

获取完整依赖树（模型及其所有子模型）

dbt ls --select model_name+ --output list

Find all models referencing this one

找出所有引用此模型的模型

grep -r "ref('model_name')" models/ --include="*.sql"


**Report to user:** "Found X downstream models: [list]. These will be affected by changes."

grep -r "ref('model_name')" models/ --include="*.sql"


**向用户报告：“发现X个下游模型：[列表]。这些模型会受到更改的影响。”**

3. Check What Columns Downstream Models Use

3. 检查下游模型使用的列

BEFORE changing any columns, check what downstream models reference:

bash

undefined

在更改任何列之前，检查下游模型引用了哪些列：

bash

undefined

For each downstream model, check what columns it uses

针对每个下游模型，检查它使用的列

cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"


If downstream models reference specific columns, you MUST ensure those columns remain available after refactoring.

cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"


如果下游模型引用了特定列，你必须确保这些列在重构后仍然可用。

4. Plan Refactoring Strategy

4. 制定重构策略

Opportunity	Strategy
Long CTE	Extract to intermediate model
Repeated logic	Create macro in `macros/`
Complex join	Split into intermediate models
Multiple concerns	Separate into focused models

重构机会	策略
长CTE	提取为中间模型
重复逻辑	在 `macros/` 目录下创建宏
复杂关联	拆分为中间模型
多关注点	拆分为专注单一职责的模型

5. Execute Refactoring

5. 执行重构

Pattern: Extract CTE to Model

模式：将CTE提取为模型

Before:

sql

-- orders.sql (200 lines)
with customer_metrics as (
    -- 50 lines of complex logic
),
order_enriched as (
    select ...
    from orders
    join customer_metrics on ...
)
select * from order_enriched

After:

sql

-- customer_metrics.sql (new file)
select
    customer_id,
    -- complex logic here
from {{ ref('customers') }}

-- orders.sql (simplified)
with order_enriched as (
    select ...
    from {{ ref('raw_orders') }} orders
    join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched

重构前：

sql

-- orders.sql（200行）
with customer_metrics as (
    -- 50行复杂逻辑
),
order_enriched as (
    select ...
    from orders
    join customer_metrics on ...
)
select * from order_enriched

重构后：

sql

-- customer_metrics.sql（新文件）
select
    customer_id,
    -- 复杂逻辑写在这里
from {{ ref('customers') }}

-- orders.sql（简化后）
with order_enriched as (
    select ...
    from {{ ref('raw_orders') }} orders
    join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched

Pattern: Extract to Macro

模式：提取为宏

Before (repeated in multiple models):

sql

case
    when amount < 0 then 'refund'
    when amount = 0 then 'zero'
    else 'positive'
end as amount_category

After:

sql

-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
    when {{ column_name }} < 0 then 'refund'
    when {{ column_name }} = 0 then 'zero'
    else 'positive'
end
{% endmacro %}

-- In models:
{{ categorize_amount('amount') }} as amount_category

重构前（在多个模型中重复出现）：

sql

case
    when amount < 0 then 'refund'
    when amount = 0 then 'zero'
    else 'positive'
end as amount_category

重构后：

sql

-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
    when {{ column_name }} < 0 then 'refund'
    when {{ column_name }} = 0 then 'zero'
    else 'positive'
end
{% endmacro %}

-- 在模型中使用：
{{ categorize_amount('amount') }} as amount_category

6. Validate Changes

6. 验证更改

bash

undefined

bash

undefined

Compile to check syntax

编译以检查语法

dbt compile --select +model_name+

Build entire lineage

构建整个依赖链

dbt build --select +model_name+

Check row counts (manual)

手动检查行数

Before: Record expected counts

重构前：记录预期行数

After: Verify counts match

重构后：验证行数是否匹配

undefined

undefined

7. Verify Output Matches Original

7. 验证输出与原版本一致

CRITICAL: Refactoring should not change output.

bash

undefined

关键：重构不应改变输出结果。

bash

undefined

Compare row counts before and after

比较重构前后的行数

dbt show --inline "select count(*) from {{ ref('model_name') }}"

Spot check key values

抽样检查关键值

dbt show --select <model_name> --limit 10

undefined

dbt show --select <model_name> --limit 10

undefined

8. Update Downstream Models

8. 更新下游模型

If changing output columns:

Update all downstream refs
Update schema.yml documentation
Re-run downstream tests

如果更改了输出列：

更新所有下游引用
更新schema.yml文档
重新运行下游测试

Refactoring Checklist

重构检查清单

Common Refactoring Triggers

常见重构触发因素

Symptom	Refactoring
Model > 200 lines	Extract CTEs to models
Same logic in 3+ models	Extract to macro
5+ joins in one model	Create intermediate models
Hard to understand	Add CTEs with clear names
Slow performance	Split to allow parallelization

症状	重构方式
模型超过200行	将CTE提取为独立模型
相同逻辑出现在3个及以上模型中	提取为宏
单个模型中有5次及以上关联	创建中间模型
难以理解	添加命名清晰的CTE
性能缓慢	拆分以支持并行处理

Anti-Patterns

反模式

Refactoring without checking downstream impact
Making multiple changes at once
Not validating output matches after refactoring
Extracting prematurely (wait for 3+ uses)
Breaking existing tests without updating them

未检查下游影响就进行重构
同时进行多项更改
重构后未验证输出是否与原版本一致
过早提取（等出现3次及以上使用场景再操作）
破坏现有测试但未更新测试用例