perf-sequence-packing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSequence Packing Skill
序列打包技能
For stable background and recommendation level, see:
- @docs/training/packed-sequences.md
- @skills/perf-sequence-packing/card.yaml
如需了解稳定背景信息和推荐等级,请查看:
- @docs/training/packed-sequences.md
- @skills/perf-sequence-packing/card.yaml
Enablement
启用方式
Offline packed SFT for LLM finetuning:
python
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
packed_sequence_size=4096,
pad_seq_to_mult=1,
)If CP is enabled:
python
cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2LLM微调的离线打包SFT:
python
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
packed_sequence_size=4096,
pad_seq_to_mult=1,
)若启用CP:
python
cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2If sequence_parallel is also enabled, use lcm(2CP, CPTP):
若同时启用sequence_parallel,请使用lcm(2CP, CPTP):
import math
import math
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
See src/megatron/bridge/training/vlm_step.py for reference logic.
参考逻辑请查看src/megatron/bridge/training/vlm_step.py。
If CUDA graphs are enabled for this packed path:
```python
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = TrueNote: also requires a metadata JSON file alongside
the packed dataset (asserted in ).
Custom packed datasets that omit the metadata file will hit an assertion at
dataset initialization.
pad_cu_seqlens = Truesrc/megatron/bridge/data/datasets/sft.pyIn-batch packing for VLM finetuning:
python
cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2Long-context baseline:
python
cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2
若为该打包路径启用CUDA graphs:
```python
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True注意: 还要求打包数据集旁存在元数据JSON文件(在中有断言检查)。省略元数据文件的自定义打包数据集会在数据集初始化时触发断言错误。
pad_cu_seqlens = Truesrc/megatron/bridge/data/datasets/sft.pyVLM微调的批内打包:
python
cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2长上下文基准配置:
python
cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2Code Anchors
代码锚点
LLM packed SFT config surface:
72
if packed_sequence:
dataset_kwargs = {"pad_to_max_length": True}
packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
dataset_kwargs = {}
packed_sequence_specs = NoneBridge validation:
1617
if self.model.context_parallel_size > 1:
assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
if isinstance(self.dataset, FinetuningDatasetConfig):
assert self.model.calculate_per_token_loss, ...
assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
raise ValueError(...)VLM in-batch runtime:
308
if enable_packing:
...
) = pack_batch_sequences(
...
pad_token_id=0,
pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
)Packed THD runtime constraint:
61
if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")LLM打包SFT配置层面:
72
if packed_sequence:
dataset_kwargs = {"pad_to_max_length": True}
packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
dataset_kwargs = {}
packed_sequence_specs = NoneBridge验证逻辑:
1617
if self.model.context_parallel_size > 1:
assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
if isinstance(self.dataset, FinetuningDatasetConfig):
assert self.model.calculate_per_token_loss, ...
assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
raise ValueError(...)VLM批内运行时逻辑:
308
if enable_packing:
...
) = pack_batch_sequences(
...
pad_token_id=0,
pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
)打包THD运行时约束:
61
if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")Pitfalls
注意事项
- Offline packed SFT and VLM in-batch packing are different features with opposite micro-batch rules.
- When CP is enabled, packed sequence lengths must respect divisibility.
2 * context_parallel_size - For finetuning with CP, and
calculate_per_token_loss=Trueare required.ddp.average_in_collective=False - also requires
pad_cu_seqlens=True.pad_to_max_length=True - Packing support is model-family-specific. ,
Qwen3-Next, andGLM-4.5contain explicit opt-outs in different paths.Qwen3.5-VL - MTP finetuning is documented as incompatible with packed sequences.
- 离线打包SFT和VLM批内打包是不同功能,微批量规则相反。
- 启用CP时,打包序列长度必须满足的整除性要求。
2 * context_parallel_size - 启用CP的微调任务必须设置和
calculate_per_token_loss=True。ddp.average_in_collective=False - 同时要求
pad_cu_seqlens=True。pad_to_max_length=True - 打包支持与模型家族相关。、
Qwen3-Next和GLM-4.5在不同路径中有明确的禁用逻辑。Qwen3.5-VL - MTP微调已被记录为与打包序列不兼容。
Verification
验证方法
Use the checked-in unit coverage:
bash
uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -vSuccess criteria:
- first command reports
8 passed - second command reports
14 passed - third command reports
2 passed
使用已提交的单元测试覆盖:
bash
uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v成功标准:
- 第一条命令显示
8 passed - 第二条命令显示
14 passed - 第三条命令显示
2 passed