nemo-mbridge-perf-hierarchical-context-parallel

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Hierarchical Context Parallel Skill

分层上下文并行技能

This skill covers hierarchical context parallelism: nested context-parallel process groups used by

cp_comm_type="a2a+p2p"

and configured with

hierarchical_context_parallel_sizes

For what hierarchical CP is, when to use it, and the decision tree (

a2a+p2p

vs pure

a2a

p2p

), see:

@docs/training/hierarchical-context-parallel.md
@skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml

本技能介绍分层上下文并行：即由

cp_comm_type="a2a+p2p"

使用、通过

hierarchical_context_parallel_sizes

配置的嵌套上下文并行进程组。

关于分层CP是什么、何时使用以及决策树（

a2a+p2p

vs 纯

a2a

p2p

），请参阅：

@docs/training/hierarchical-context-parallel.md
@skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml

Enablement

启用方法

Minimal Bridge override:

python

cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False

Required constraints:

prod(hierarchical_context_parallel_sizes) == context_parallel_size

seq_length % (2 * context_parallel_size) == 0

Transformer Engine
```
>= 1.12.0
```

最小化Bridge配置覆盖：

python

cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False

必须满足的约束条件：

prod(hierarchical_context_parallel_sizes) == context_parallel_size

seq_length % (2 * context_parallel_size) == 0

Transformer Engine 版本需
```
>= 1.12.0
```

Code Anchors

代码锚点

Upstream config and validation:

context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""

428

if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"

Bridge MPU path:

613

parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()

Bridge decentralized-PG path:

503

pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)

上游配置与验证：

context_parallel_size: int = 1
"""将网络输入沿序列维度拆分到多个GPU进程中。"""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""分层上下文并行的层级数。用户需提供一个列表来指定不同层级的大小。以a2a+p2p类型的CP通信为例，它包含两个层级的组，因此列表中的第一个值表示a2a通信类型的组大小，第二个值表示p2p通信类型的组大小。"""

428

if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"

Bridge的MPU路径：

613

parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()

Bridge的去中心化PG路径：

503

pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)

Implementation Map

实现映射

The code anchors above show the config declarations and argument validation.

上述代码锚点展示了配置声明和参数验证逻辑。

Validation (MCore)

验证（MCore）

TransformerConfig.__post_init__

enforces that

a2a+p2p

requires HCP sizes and the product matches CP.

TransformerConfig.__post_init__

会强制验证：当使用

a2a+p2p

时必须指定HCP大小，且其乘积需与CP大小匹配。

Process group creation

进程组创建

parallel_state.initialize_model_parallel

creates hierarchical CP sub-groups when HCP sizes are provided via

create_hierarchical_groups

. Bridge currently gets those groups through the MPU-backed

ProcessGroupCollection

当通过

create_hierarchical_groups

传入HCP大小时，

parallel_state.initialize_model_parallel

会创建分层CP子组。目前Bridge通过基于MPU的

ProcessGroupCollection

获取这些组。

TE integration

TE集成

TEDotProductAttention

passes the hierarchical groups to Transformer Engine when

a2a+p2p

is used. Requires Transformer Engine >= 1.12.0.

当使用

a2a+p2p

时，

TEDotProductAttention

会将分层组传递给Transformer Engine。要求Transformer Engine >= 1.12.0。

Pitfalls

注意事项

Bridge HCP is MPU-only today: If
```
use_decentralized_pg=True
```
, Bridge initializes flat CP groups and leaves HCP unset.
No checked-in Bridge recipe currently exercises HCP directly.
Single-GPU load helpers clear
```
hierarchical_context_parallel_sizes
```
.
Silent broken training on old stacks: If you use
```
a2a+p2p
```
without setting
```
hierarchical_context_parallel_sizes
```
, MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.
Product must match:
```
prod(hierarchical_context_parallel_sizes)
```
must exactly equal
```
context_parallel_size
```
. A mismatch triggers an assertion.
Verify in logs: Look for the process group initialization output. You should see
```
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
```
being created. If you only see
```
CONTEXT_PARALLEL_GROUP
```
, HCP is not active.

当前Bridge的HCP仅支持MPU：如果设置
```
use_decentralized_pg=True
```
，Bridge会初始化扁平CP组，且不会设置HCP。
目前没有已提交的Bridge recipe直接测试HCP。
单GPU加载工具会清除
hierarchical_context_parallel_sizes
配置。
旧版本栈会导致无提示的训练失败：如果使用
```
a2a+p2p
```
但未设置
```
hierarchical_context_parallel_sizes
```
，当前MCore会触发断言。而旧版本会无提示地禁用CP通信，导致每个进程仅处理本地数据块，产生虚假的高吞吐量但梯度已损坏。
乘积必须匹配：
```
hierarchical_context_parallel_sizes
```
的乘积必须完全等于
```
context_parallel_size
```
，不匹配会触发断言。
通过日志验证：查看进程组初始化输出，应能看到
```
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
```
被创建。如果仅看到
```
CONTEXT_PARALLEL_GROUP
```
，则HCP未激活。

Verification

验证方法

No dedicated Bridge end-to-end test exists yet for HCP (see @skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml

follow_up_validation

). Use the existing unit tests and log inspection instead.

Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved:

bash

uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q

For a manual smoke check, launch a 4-GPU run with a small recipe and

cp_comm_type=a2a+p2p

plus

hierarchical_context_parallel_sizes=[2,2]

bash

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2

Success criteria:

Logs show
```
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
```
being created
Training completes at least one step without error
If you only see
```
CONTEXT_PARALLEL_GROUP
```
, HCP is not active

目前还没有针对HCP的专用Bridge端到端测试（请参阅@skills/nemo-mbridge-perf-hierarchical-context-parallel/card.yaml中的

follow_up_validation

）。请改用现有单元测试和日志检查。

运行去中心化PG单元测试，确认扁平CP行为正常：

bash

uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q

如需手动冒烟测试，请使用小型recipe启动4-GPU运行，设置

cp_comm_type=a2a+p2p

和

hierarchical_context_parallel_sizes=[2,2]

：

bash

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2

成功标准：

日志中显示
```
HIERARCHICAL_CONTEXT_PARALLEL_GROUPS
```
已创建
训练至少完成一个步骤且无错误
如果仅看到
```
CONTEXT_PARALLEL_GROUP
```
，则HCP未激活