refactoring-dbt-models
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesedbt Refactoring
dbt 重构
Find ALL downstream dependencies before changing. Refactor in small steps. Verify output after each change.
在进行更改前找出所有下游依赖关系。分步进行重构。每次更改后验证输出。
Workflow
工作流
1. Analyze Current Model
1. 分析当前模型
bash
cat models/<path>/<model_name>.sqlIdentify refactoring opportunities:
- CTEs longer than 50 lines → extract to intermediate model
- Logic repeated across models → extract to macro
- Multiple joins in sequence → split into steps
- Complex WHERE clauses → extract to staging filter
bash
cat models/<path>/<model_name>.sql识别重构机会:
- 超过50行的CTE → 提取为中间模型
- 跨模型重复的逻辑 → 提取为宏
- 连续多次关联 → 拆分为多个步骤
- 复杂的WHERE子句 → 提取为 staging 过滤器
2. Find All Downstream Dependencies
2. 找出所有下游依赖关系
CRITICAL: Never refactor without knowing impact.
bash
undefined关键:绝不要在不了解影响范围的情况下进行重构。
bash
undefinedGet full dependency tree (model and all its children)
获取完整依赖树(模型及其所有子模型)
dbt ls --select model_name+ --output list
dbt ls --select model_name+ --output list
Find all models referencing this one
找出所有引用此模型的模型
grep -r "ref('model_name')" models/ --include="*.sql"
**Report to user:** "Found X downstream models: [list]. These will be affected by changes."grep -r "ref('model_name')" models/ --include="*.sql"
**向用户报告:“发现X个下游模型:[列表]。这些模型会受到更改的影响。”**3. Check What Columns Downstream Models Use
3. 检查下游模型使用的列
BEFORE changing any columns, check what downstream models reference:
bash
undefined在更改任何列之前,检查下游模型引用了哪些列:
bash
undefinedFor each downstream model, check what columns it uses
针对每个下游模型,检查它使用的列
cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"
If downstream models reference specific columns, you MUST ensure those columns remain available after refactoring.cat models/<path>/<downstream_model>.sql | grep -E "model_name.\w+|alias.\w+"
如果下游模型引用了特定列,你必须确保这些列在重构后仍然可用。4. Plan Refactoring Strategy
4. 制定重构策略
| Opportunity | Strategy |
|---|---|
| Long CTE | Extract to intermediate model |
| Repeated logic | Create macro in |
| Complex join | Split into intermediate models |
| Multiple concerns | Separate into focused models |
| 重构机会 | 策略 |
|---|---|
| 长CTE | 提取为中间模型 |
| 重复逻辑 | 在 |
| 复杂关联 | 拆分为中间模型 |
| 多关注点 | 拆分为专注单一职责的模型 |
5. Execute Refactoring
5. 执行重构
Pattern: Extract CTE to Model
模式:将CTE提取为模型
Before:
sql
-- orders.sql (200 lines)
with customer_metrics as (
-- 50 lines of complex logic
),
order_enriched as (
select ...
from orders
join customer_metrics on ...
)
select * from order_enrichedAfter:
sql
-- customer_metrics.sql (new file)
select
customer_id,
-- complex logic here
from {{ ref('customers') }}
-- orders.sql (simplified)
with order_enriched as (
select ...
from {{ ref('raw_orders') }} orders
join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched重构前:
sql
-- orders.sql(200行)
with customer_metrics as (
-- 50行复杂逻辑
),
order_enriched as (
select ...
from orders
join customer_metrics on ...
)
select * from order_enriched重构后:
sql
-- customer_metrics.sql(新文件)
select
customer_id,
-- 复杂逻辑写在这里
from {{ ref('customers') }}
-- orders.sql(简化后)
with order_enriched as (
select ...
from {{ ref('raw_orders') }} orders
join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enrichedPattern: Extract to Macro
模式:提取为宏
Before (repeated in multiple models):
sql
case
when amount < 0 then 'refund'
when amount = 0 then 'zero'
else 'positive'
end as amount_categoryAfter:
sql
-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
when {{ column_name }} < 0 then 'refund'
when {{ column_name }} = 0 then 'zero'
else 'positive'
end
{% endmacro %}
-- In models:
{{ categorize_amount('amount') }} as amount_category重构前(在多个模型中重复出现):
sql
case
when amount < 0 then 'refund'
when amount = 0 then 'zero'
else 'positive'
end as amount_category重构后:
sql
-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
when {{ column_name }} < 0 then 'refund'
when {{ column_name }} = 0 then 'zero'
else 'positive'
end
{% endmacro %}
-- 在模型中使用:
{{ categorize_amount('amount') }} as amount_category6. Validate Changes
6. 验证更改
bash
undefinedbash
undefinedCompile to check syntax
编译以检查语法
dbt compile --select +model_name+
dbt compile --select +model_name+
Build entire lineage
构建整个依赖链
dbt build --select +model_name+
dbt build --select +model_name+
Check row counts (manual)
手动检查行数
Before: Record expected counts
重构前:记录预期行数
After: Verify counts match
重构后:验证行数是否匹配
undefinedundefined7. Verify Output Matches Original
7. 验证输出与原版本一致
CRITICAL: Refactoring should not change output.
bash
undefined关键:重构不应改变输出结果。
bash
undefinedCompare row counts before and after
比较重构前后的行数
dbt show --inline "select count(*) from {{ ref('model_name') }}"
dbt show --inline "select count(*) from {{ ref('model_name') }}"
Spot check key values
抽样检查关键值
dbt show --select <model_name> --limit 10
undefineddbt show --select <model_name> --limit 10
undefined8. Update Downstream Models
8. 更新下游模型
If changing output columns:
- Update all downstream refs
- Update schema.yml documentation
- Re-run downstream tests
如果更改了输出列:
- 更新所有下游引用
- 更新schema.yml文档
- 重新运行下游测试
Refactoring Checklist
重构检查清单
- All downstream dependencies identified
- User informed of impact scope
- One change at a time
- Compile passes after each change
- Build passes after each change
- Output validated (row counts match)
- Documentation updated
- Tests still pass
- 已识别所有下游依赖关系
- 已告知用户影响范围
- 每次只进行一项更改
- 每次更改后编译通过
- 每次更改后构建通过
- 已验证输出(行数匹配)
- 已更新文档
- 测试仍能通过
Common Refactoring Triggers
常见重构触发因素
| Symptom | Refactoring |
|---|---|
| Model > 200 lines | Extract CTEs to models |
| Same logic in 3+ models | Extract to macro |
| 5+ joins in one model | Create intermediate models |
| Hard to understand | Add CTEs with clear names |
| Slow performance | Split to allow parallelization |
| 症状 | 重构方式 |
|---|---|
| 模型超过200行 | 将CTE提取为独立模型 |
| 相同逻辑出现在3个及以上模型中 | 提取为宏 |
| 单个模型中有5次及以上关联 | 创建中间模型 |
| 难以理解 | 添加命名清晰的CTE |
| 性能缓慢 | 拆分以支持并行处理 |
Anti-Patterns
反模式
- Refactoring without checking downstream impact
- Making multiple changes at once
- Not validating output matches after refactoring
- Extracting prematurely (wait for 3+ uses)
- Breaking existing tests without updating them
- 未检查下游影响就进行重构
- 同时进行多项更改
- 重构后未验证输出是否与原版本一致
- 过早提取(等出现3次及以上使用场景再操作)
- 破坏现有测试但未更新测试用例