perf-sequence-packing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Sequence Packing Skill

序列打包技能

For stable background and recommendation level, see:

@docs/training/packed-sequences.md
@skills/perf-sequence-packing/card.yaml

如需了解稳定背景信息和推荐等级，请查看：

@docs/training/packed-sequences.md
@skills/perf-sequence-packing/card.yaml

Enablement

启用方式

Offline packed SFT for LLM finetuning:

python

from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
    packed_sequence_size=4096,
    pad_seq_to_mult=1,
)

If CP is enabled:

python

cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2

LLM微调的离线打包SFT：

python

from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
    packed_sequence_size=4096,
    pad_seq_to_mult=1,
)

若启用CP：

python

cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2

If sequence_parallel is also enabled, use lcm(2CP, CPTP):

若同时启用sequence_parallel，请使用lcm(2CP, CPTP)：

import math

cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)

See src/megatron/bridge/training/vlm_step.py for reference logic.

参考逻辑请查看src/megatron/bridge/training/vlm_step.py。


If CUDA graphs are enabled for this packed path:

```python
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True

Note:

pad_cu_seqlens = True

also requires a metadata JSON file alongside the packed dataset (asserted in

src/megatron/bridge/data/datasets/sft.py

). Custom packed datasets that omit the metadata file will hit an assertion at dataset initialization.

In-batch packing for VLM finetuning:

python

cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2

Long-context baseline:

python

cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2


若为该打包路径启用CUDA graphs：

```python
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True

注意：

pad_cu_seqlens = True

还要求打包数据集旁存在元数据JSON文件（在

src/megatron/bridge/data/datasets/sft.py

中有断言检查）。省略元数据文件的自定义打包数据集会在数据集初始化时触发断言错误。

VLM微调的批内打包：

python

cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2

长上下文基准配置：

python

cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2

Code Anchors

代码锚点

LLM packed SFT config surface:

if packed_sequence:
    dataset_kwargs = {"pad_to_max_length": True}
    packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
    dataset_kwargs = {}
    packed_sequence_specs = None

Bridge validation:

1617

if self.model.context_parallel_size > 1:
    assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
    if isinstance(self.dataset, FinetuningDatasetConfig):
        assert self.model.calculate_per_token_loss, ...
        assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
    raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
    raise ValueError(...)

VLM in-batch runtime:

308

if enable_packing:
    ...
    ) = pack_batch_sequences(
        ...
        pad_token_id=0,
        pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
    )

Packed THD runtime constraint:

if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
    raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")

LLM打包SFT配置层面：

if packed_sequence:
    dataset_kwargs = {"pad_to_max_length": True}
    packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
    dataset_kwargs = {}
    packed_sequence_specs = None

Bridge验证逻辑：

1617

if self.model.context_parallel_size > 1:
    assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
    if isinstance(self.dataset, FinetuningDatasetConfig):
        assert self.model.calculate_per_token_loss, ...
        assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
    raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
    raise ValueError(...)

VLM批内运行时逻辑：

308

if enable_packing:
    ...
    ) = pack_batch_sequences(
        ...
        pad_token_id=0,
        pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
    )

打包THD运行时约束：

if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
    raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")

Pitfalls

注意事项

Offline packed SFT and VLM in-batch packing are different features with opposite micro-batch rules.
When CP is enabled, packed sequence lengths must respect
```
2 * context_parallel_size
```
divisibility.

For finetuning with CP,

calculate_per_token_loss=True

and

ddp.average_in_collective=False

are required.

pad_cu_seqlens=True

also requires

pad_to_max_length=True

Packing support is model-family-specific.
```
Qwen3-Next
```
,
```
GLM-4.5
```
, and
```
Qwen3.5-VL
```
contain explicit opt-outs in different paths.
MTP finetuning is documented as incompatible with packed sequences.

离线打包SFT和VLM批内打包是不同功能，微批量规则相反。
启用CP时，打包序列长度必须满足
```
2 * context_parallel_size
```
的整除性要求。

启用CP的微调任务必须设置

calculate_per_token_loss=True

和

ddp.average_in_collective=False

。

pad_cu_seqlens=True

同时要求

pad_to_max_length=True

。

打包支持与模型家族相关。
```
Qwen3-Next
```
、
```
GLM-4.5
```
和
```
Qwen3.5-VL
```
在不同路径中有明确的禁用逻辑。
MTP微调已被记录为与打包序列不兼容。

Verification

验证方法

Use the checked-in unit coverage:

bash

uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v

Success criteria:

first command reports
```
8 passed
```
second command reports
```
14 passed
```
third command reports
```
2 passed
```

使用已提交的单元测试覆盖：

bash

uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v

成功标准：

第一条命令显示
```
8 passed
```
第二条命令显示
```
14 passed
```
第三条命令显示
```
2 passed
```