perf-hierarchical-context-parallel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hierarchical Context Parallel Skill

分层上下文并行技能

This skill covers hierarchical context parallelism: nested context-parallel process groups used by
cp_comm_type="a2a+p2p"
and configured with
hierarchical_context_parallel_sizes
.
For what hierarchical CP is, when to use it, and the decision tree (
a2a+p2p
vs pure
a2a
vs
p2p
), see:
  • @docs/training/hierarchical-context-parallel.md
  • @skills/perf-hierarchical-context-parallel/card.yaml
本技能介绍分层上下文并行:由
cp_comm_type="a2a+p2p"
使用、通过
hierarchical_context_parallel_sizes
配置的嵌套上下文并行进程组。
关于什么是分层CP、何时使用它,以及决策树(
a2a+p2p
vs 纯
a2a
vs
p2p
),请参阅:
  • @docs/training/hierarchical-context-parallel.md
  • @skills/perf-hierarchical-context-parallel/card.yaml

Enablement

启用方法

Minimal Bridge override:
python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False
Required constraints:
  • prod(hierarchical_context_parallel_sizes) == context_parallel_size
  • seq_length % (2 * context_parallel_size) == 0
  • Transformer Engine
    >= 1.12.0
最小化Bridge覆盖配置:
python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False
必要约束条件:
  • prod(hierarchical_context_parallel_sizes) == context_parallel_size
  • seq_length % (2 * context_parallel_size) == 0
  • Transformer Engine
    >= 1.12.0

Code Anchors

代码锚点

Upstream config and validation:
45
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""
428
if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"
Bridge MPU path:
613
parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()
Bridge decentralized-PG path:
503
pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)
上游配置与验证:
45
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""
428
if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"
Bridge MPU路径:
613
parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()
Bridge去中心化PG路径:
503
pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)

Implementation Map

实现映射

Config definition

配置定义

hierarchical_context_parallel_sizes
is declared in
ModelParallelConfig
:
undefined
hierarchical_context_parallel_sizes
ModelParallelConfig
中声明:
undefined

3rdparty/Megatron-LM/megatron/core/model_parallel_config.py

3rdparty/Megatron-LM/megatron/core/model_parallel_config.py

hierarchical_context_parallel_sizes: Optional[list[int]] = None
hierarchical_context_parallel_sizes: Optional[list[int]] = None

For a2a+p2p, first value = a2a group size, second value = p2p group size.

For a2a+p2p, first value = a2a group size, second value = p2p group size.

Product must equal context_parallel_size.

Product must equal context_parallel_size.


`cp_comm_type` is declared in `TransformerConfig`:

`cp_comm_type`在`TransformerConfig`中声明:

3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py

3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py

cp_comm_type: Optional[Union[str, List[str]]] = None
cp_comm_type: Optional[Union[str, List[str]]] = None

Can be per-layer (List[str]) or uniform (str).

Can be per-layer (List[str]) or uniform (str).

Values: "p2p", "all_gather", "a2a", "a2a+p2p"

Values: "p2p", "all_gather", "a2a", "a2a+p2p"

undefined
undefined

Validation (MCore)

验证(MCore)

TransformerConfig.__post_init__
enforces that
a2a+p2p
requires HCP sizes and the product matches CP.
TransformerConfig.__post_init__
强制要求使用
a2a+p2p
时必须设置HCP大小,且其乘积与CP大小匹配。

Process group creation

进程组创建

parallel_state.initialize_model_parallel
creates hierarchical CP sub-groups when HCP sizes are provided via
create_hierarchical_groups
. Bridge currently gets those groups through the MPU-backed
ProcessGroupCollection
.
当通过
create_hierarchical_groups
提供HCP大小时,
parallel_state.initialize_model_parallel
会创建分层CP子组。目前Bridge通过基于MPU的
ProcessGroupCollection
获取这些组。

TE integration

TE集成

TEDotProductAttention
passes the hierarchical groups to Transformer Engine when
a2a+p2p
is used. Requires Transformer Engine >= 1.12.0.
当使用
a2a+p2p
时,
TEDotProductAttention
会将分层组传递给Transformer Engine。要求Transformer Engine >= 1.12.0

Pitfalls

常见陷阱

  1. Bridge HCP is MPU-only today: If
    use_decentralized_pg=True
    , Bridge initializes flat CP groups and leaves HCP unset.
  2. No checked-in Bridge recipe currently exercises HCP directly.
  3. Single-GPU load helpers clear
    hierarchical_context_parallel_sizes
    .
  4. Silent broken training on old stacks: If you use
    a2a+p2p
    without setting
    hierarchical_context_parallel_sizes
    , MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.
  5. Product must match:
    prod(hierarchical_context_parallel_sizes)
    must exactly equal
    context_parallel_size
    . A mismatch triggers an assertion.
  6. Verify in logs: Look for the process group initialization output. You should see
    HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
    being created. If you only see
    CONTEXT_PARALLEL_GROUP
    , HCP is not active.
  1. 当前Bridge HCP仅支持MPU:如果
    use_decentralized_pg=True
    ,Bridge会初始化扁平CP组,且不设置HCP。
  2. 目前没有已提交的Bridge recipe直接测试HCP
  3. 单GPU加载工具会清除
    hierarchical_context_parallel_sizes
  4. 旧版本栈上会出现无提示的训练异常:如果使用
    a2a+p2p
    但未设置
    hierarchical_context_parallel_sizes
    ,MCore现在会触发断言。旧版本会无提示地禁用CP通信,导致每个rank仅关注本地数据块,产生虚假的高吞吐量但梯度损坏。
  5. 乘积必须匹配
    hierarchical_context_parallel_sizes
    的乘积必须完全等于
    context_parallel_size
    。不匹配会触发断言。
  6. 在日志中验证:查看进程组初始化输出,应该能看到
    HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
    被创建。如果仅看到
    CONTEXT_PARALLEL_GROUP
    ,则HCP未激活。

Verification

验证方法

No dedicated Bridge end-to-end test exists yet for HCP (see @skills/perf-hierarchical-context-parallel/card.yaml
follow_up_validation
). Use the existing unit tests and log inspection instead.
Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved:
bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q
For a manual smoke check, launch a 4-GPU run with a small recipe and
cp_comm_type=a2a+p2p
plus
hierarchical_context_parallel_sizes=[2,2]
:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2
Success criteria:
  • Logs show
    HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
    being created
  • Training completes at least one step without error
  • If you only see
    CONTEXT_PARALLEL_GROUP
    , HCP is not active
目前还没有针对HCP的专用Bridge端到端测试(请参阅@skills/perf-hierarchical-context-parallel/card.yaml中的
follow_up_validation
)。请改用现有单元测试和日志检查。
运行去中心化PG单元测试,确认扁平CP行为是否保留:
bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q
如需手动冒烟测试,使用小recipe启动4-GPU运行,设置
cp_comm_type=a2a+p2p
hierarchical_context_parallel_sizes=[2,2]
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2
成功标准:
  • 日志显示
    HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
    被创建
  • 训练至少完成一个步骤且无错误
  • 如果仅看到
    CONTEXT_PARALLEL_GROUP
    ,则HCP未激活