refactoring-dbt-models

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

dbt Refactoring

dbt 重构

Find ALL downstream dependencies before changing. Refactor in small steps. Verify output after each change.
在进行更改前找出所有下游依赖关系。分步进行重构。每次更改后验证输出。

Workflow

工作流

1. Analyze Current Model

1. 分析当前模型

bash
cat models/<path>/<model_name>.sql
Identify refactoring opportunities:
  • CTEs longer than 50 lines → extract to intermediate model
  • Logic repeated across models → extract to macro
  • Multiple joins in sequence → split into steps
  • Complex WHERE clauses → extract to staging filter
bash
cat models/<path>/<model_name>.sql
识别重构机会:
  • 超过50行的CTE → 提取为中间模型
  • 跨模型重复的逻辑 → 提取为宏
  • 连续多次关联 → 拆分为多个步骤
  • 复杂的WHERE子句 → 提取为 staging 过滤器

2. Find All Downstream Dependencies

2. 找出所有下游依赖关系

CRITICAL: Never refactor without knowing impact.
bash
undefined
关键:绝不要在不了解影响范围的情况下进行重构。
bash
undefined

Get full dependency tree (model and all its children)

获取完整依赖树(模型及其所有子模型)

dbt ls --select model_name+ --output list
dbt ls --select model_name+ --output list

Find all models referencing this one

找出所有引用此模型的模型

grep -r "ref('model_name')" models/ --include="*.sql"

**Report to user:** "Found X downstream models: [list]. These will be affected by changes."
grep -r "ref('model_name')" models/ --include="*.sql"

**向用户报告:“发现X个下游模型:[列表]。这些模型会受到更改的影响。”**

3. Check What Columns Downstream Models Use

3. 检查下游模型使用的列

BEFORE changing any columns, check what downstream models reference:
bash
undefined
在更改任何列之前,检查下游模型引用了哪些列:
bash
undefined

For each downstream model, check what columns it uses

针对每个下游模型,检查它使用的列

cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"

If downstream models reference specific columns, you MUST ensure those columns remain available after refactoring.
cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"

如果下游模型引用了特定列,你必须确保这些列在重构后仍然可用。

4. Plan Refactoring Strategy

4. 制定重构策略

OpportunityStrategy
Long CTEExtract to intermediate model
Repeated logicCreate macro in
macros/
Complex joinSplit into intermediate models
Multiple concernsSeparate into focused models
重构机会策略
长CTE提取为中间模型
重复逻辑
macros/
目录下创建宏
复杂关联拆分为中间模型
多关注点拆分为专注单一职责的模型

5. Execute Refactoring

5. 执行重构

Pattern: Extract CTE to Model

模式:将CTE提取为模型

Before:
sql
-- orders.sql (200 lines)
with customer_metrics as (
    -- 50 lines of complex logic
),
order_enriched as (
    select ...
    from orders
    join customer_metrics on ...
)
select * from order_enriched
After:
sql
-- customer_metrics.sql (new file)
select
    customer_id,
    -- complex logic here
from {{ ref('customers') }}

-- orders.sql (simplified)
with order_enriched as (
    select ...
    from {{ ref('raw_orders') }} orders
    join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched
重构前:
sql
-- orders.sql(200行)
with customer_metrics as (
    -- 50行复杂逻辑
),
order_enriched as (
    select ...
    from orders
    join customer_metrics on ...
)
select * from order_enriched
重构后:
sql
-- customer_metrics.sql(新文件)
select
    customer_id,
    -- 复杂逻辑写在这里
from {{ ref('customers') }}

-- orders.sql(简化后)
with order_enriched as (
    select ...
    from {{ ref('raw_orders') }} orders
    join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched

Pattern: Extract to Macro

模式:提取为宏

Before (repeated in multiple models):
sql
case
    when amount < 0 then 'refund'
    when amount = 0 then 'zero'
    else 'positive'
end as amount_category
After:
sql
-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
    when {{ column_name }} < 0 then 'refund'
    when {{ column_name }} = 0 then 'zero'
    else 'positive'
end
{% endmacro %}

-- In models:
{{ categorize_amount('amount') }} as amount_category
重构前(在多个模型中重复出现):
sql
case
    when amount < 0 then 'refund'
    when amount = 0 then 'zero'
    else 'positive'
end as amount_category
重构后:
sql
-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
    when {{ column_name }} < 0 then 'refund'
    when {{ column_name }} = 0 then 'zero'
    else 'positive'
end
{% endmacro %}

-- 在模型中使用:
{{ categorize_amount('amount') }} as amount_category

6. Validate Changes

6. 验证更改

bash
undefined
bash
undefined

Compile to check syntax

编译以检查语法

dbt compile --select +model_name+
dbt compile --select +model_name+

Build entire lineage

构建整个依赖链

dbt build --select +model_name+
dbt build --select +model_name+

Check row counts (manual)

手动检查行数

Before: Record expected counts

重构前:记录预期行数

After: Verify counts match

重构后:验证行数是否匹配

undefined
undefined

7. Verify Output Matches Original

7. 验证输出与原版本一致

CRITICAL: Refactoring should not change output.
bash
undefined
关键:重构不应改变输出结果。
bash
undefined

Compare row counts before and after

比较重构前后的行数

dbt show --inline "select count(*) from {{ ref('model_name') }}"
dbt show --inline "select count(*) from {{ ref('model_name') }}"

Spot check key values

抽样检查关键值

dbt show --select <model_name> --limit 10
undefined
dbt show --select <model_name> --limit 10
undefined

8. Update Downstream Models

8. 更新下游模型

If changing output columns:
  1. Update all downstream refs
  2. Update schema.yml documentation
  3. Re-run downstream tests
如果更改了输出列:
  1. 更新所有下游引用
  2. 更新schema.yml文档
  3. 重新运行下游测试

Refactoring Checklist

重构检查清单

  • All downstream dependencies identified
  • User informed of impact scope
  • One change at a time
  • Compile passes after each change
  • Build passes after each change
  • Output validated (row counts match)
  • Documentation updated
  • Tests still pass
  • 已识别所有下游依赖关系
  • 已告知用户影响范围
  • 每次只进行一项更改
  • 每次更改后编译通过
  • 每次更改后构建通过
  • 已验证输出(行数匹配)
  • 已更新文档
  • 测试仍能通过

Common Refactoring Triggers

常见重构触发因素

SymptomRefactoring
Model > 200 linesExtract CTEs to models
Same logic in 3+ modelsExtract to macro
5+ joins in one modelCreate intermediate models
Hard to understandAdd CTEs with clear names
Slow performanceSplit to allow parallelization
症状重构方式
模型超过200行将CTE提取为独立模型
相同逻辑出现在3个及以上模型中提取为宏
单个模型中有5次及以上关联创建中间模型
难以理解添加命名清晰的CTE
性能缓慢拆分以支持并行处理

Anti-Patterns

反模式

  • Refactoring without checking downstream impact
  • Making multiple changes at once
  • Not validating output matches after refactoring
  • Extracting prematurely (wait for 3+ uses)
  • Breaking existing tests without updating them
  • 未检查下游影响就进行重构
  • 同时进行多项更改
  • 重构后未验证输出是否与原版本一致
  • 过早提取(等出现3次及以上使用场景再操作)
  • 破坏现有测试但未更新测试用例