nemo-curator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NeMo Curator - GPU-Accelerated Data Curation

NeMo Curator - GPU加速的数据整理工具

NVIDIA's toolkit for preparing high-quality training data for LLMs.
NVIDIA推出的用于为LLM制备高质量训练数据的工具包。

When to use NeMo Curator

何时使用NeMo Curator

Use NeMo Curator when:
  • Preparing LLM training data from web scrapes (Common Crawl)
  • Need fast deduplication (16× faster than CPU)
  • Curating multi-modal datasets (text, images, video, audio)
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU cluster
Performance:
  • 16× faster fuzzy deduplication (8TB RedPajama v2)
  • 40% lower TCO vs CPU alternatives
  • Near-linear scaling across GPU nodes
Use alternatives instead:
  • datatrove: CPU-based, open-source data processing
  • dolma: Allen AI's data toolkit
  • Ray Data: General ML data processing (no curation focus)
在以下场景使用NeMo Curator:
  • 从网络爬取数据(如Common Crawl)制备LLM训练数据
  • 需要快速去重(比CPU快16倍)
  • 整理多模态数据集(文本、图像、视频、音频)
  • 过滤低质量或有害内容
  • 在GPU集群上扩展数据处理
性能表现:
  • 模糊去重速度提升16倍(处理8TB RedPajama v2数据集)
  • 与CPU方案相比,TCO降低40%
  • 在GPU节点间实现近线性扩展
可选择替代方案:
  • datatrove:基于CPU的开源数据处理工具
  • dolma:Allen AI的数据工具包
  • Ray Data:通用机器学习数据处理工具(无数据整理专项优化)

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Text curation (CUDA 12)

Text curation (CUDA 12)

uv pip install "nemo-curator[text_cuda12]"
uv pip install "nemo-curator[text_cuda12]"

All modalities

All modalities

uv pip install "nemo-curator[all_cuda12]"
uv pip install "nemo-curator[all_cuda12]"

CPU-only (slower)

CPU-only (slower)

uv pip install "nemo-curator[cpu]"
undefined
uv pip install "nemo-curator[cpu]"
undefined

Basic text curation pipeline

基础文本整理流水线

python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd
python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd

Load data

Load data

df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df)
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df)

Quality filtering

Quality filtering

def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)

Deduplication

Deduplication

from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered)
from nemo_curator.modules import ExactDuplicates deduped = ExactDuplicates()(filtered)

Save

Save

deduped.to_parquet("curated_data/")
undefined
deduped.to_parquet("curated_data/")
undefined

Data curation pipeline

数据整理流水线

Stage 1: Quality filtering

阶段1:质量过滤

python
from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter
)
python
from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter
)

Apply 30+ heuristic filters

Apply 30+ heuristic filters

from nemo_curator import ScoreFilter
from nemo_curator import ScoreFilter

Word count filter

Word count filter

dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

Remove repetitive content

Remove repetitive content

dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

URL ratio filter

URL ratio filter

dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
undefined
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
undefined

Stage 2: Deduplication

阶段2:去重

Exact deduplication:
python
from nemo_curator.modules import ExactDuplicates
精确去重:
python
from nemo_curator.modules import ExactDuplicates

Remove exact duplicates

Remove exact duplicates

deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

**Fuzzy deduplication** (16× faster on GPU):
```python
from nemo_curator.modules import FuzzyDuplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

**模糊去重**(GPU上速度提升16倍):
```python
from nemo_curator.modules import FuzzyDuplicates

MinHash + LSH deduplication

MinHash + LSH deduplication

fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" )
deduped = fuzzy_dedup(dataset)

**Semantic deduplication**:
```python
from nemo_curator.modules import SemanticDuplicates
fuzzy_dedup = FuzzyDuplicates( id_field="id", text_field="text", num_hashes=260, # MinHash parameters num_buckets=20, hash_method="md5" )
deduped = fuzzy_dedup(dataset)

**语义去重**:
```python
from nemo_curator.modules import SemanticDuplicates

Embedding-based deduplication

Embedding-based deduplication

semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold )
deduped = semantic_dedup(dataset)
undefined
semantic_dedup = SemanticDuplicates( id_field="id", text_field="text", embedding_model="sentence-transformers/all-MiniLM-L6-v2", threshold=0.8 # Cosine similarity threshold )
deduped = semantic_dedup(dataset)
undefined

Stage 3: PII redaction

阶段3:PII信息脱敏

python
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor
python
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor

Redact personally identifiable information

Redact personally identifiable information

pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" )
redacted = Modify(pii_redactor)(dataset)
undefined
pii_redactor = PIIRedactor( supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"], anonymize_action="replace" # or "redact" )
redacted = Modify(pii_redactor)(dataset)
undefined

Stage 4: Classifier filtering

阶段4:分类器过滤

python
from nemo_curator.classifiers import QualityClassifier
python
from nemo_curator.classifiers import QualityClassifier

Quality classification

Quality classification

quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" )
quality_clf = QualityClassifier( model_path="nvidia/quality-classifier-deberta", batch_size=256, device="cuda" )

Filter low-quality documents

Filter low-quality documents

high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
undefined
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
undefined

GPU acceleration

GPU加速

GPU vs CPU performance

GPU与CPU性能对比

OperationCPU (16 cores)GPU (A100)Speedup
Fuzzy dedup (8TB)120 hours7.5 hours16×
Exact dedup (1TB)8 hours0.5 hours16×
Quality filtering2 hours0.2 hours10×
操作CPU(16核)GPU(A100)加速比
模糊去重(8TB)120小时7.5小时16×
精确去重(1TB)8小时0.5小时16×
质量过滤2小时0.2小时10×

Multi-GPU scaling

多GPU扩展

python
from nemo_curator import get_client
import dask_cuda
python
from nemo_curator import get_client
import dask_cuda

Initialize GPU cluster

Initialize GPU cluster

client = get_client(cluster_type="gpu", n_workers=8)
client = get_client(cluster_type="gpu", n_workers=8)

Process with 8 GPUs

Process with 8 GPUs

deduped = FuzzyDuplicates(...)(dataset)
undefined
deduped = FuzzyDuplicates(...)(dataset)
undefined

Multi-modal curation

多模态数据整理

Image curation

图像整理

python
from nemo_curator.image import (
    AestheticFilter,
    NSFWFilter,
    CLIPEmbedder
)
python
from nemo_curator.image import (
    AestheticFilter,
    NSFWFilter,
    CLIPEmbedder
)

Aesthetic scoring

Aesthetic scoring

aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset)
aesthetic_filter = AestheticFilter(threshold=5.0) filtered_images = aesthetic_filter(image_dataset)

NSFW detection

NSFW detection

nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images)
nsfw_filter = NSFWFilter(threshold=0.9) safe_images = nsfw_filter(filtered_images)

Generate CLIP embeddings

Generate CLIP embeddings

clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)
undefined
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32") image_embeddings = clip_embedder(safe_images)
undefined

Video curation

视频整理

python
from nemo_curator.video import (
    SceneDetector,
    ClipExtractor,
    InternVideo2Embedder
)
python
from nemo_curator.video import (
    SceneDetector,
    ClipExtractor,
    InternVideo2Embedder
)

Detect scenes

Detect scenes

scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset)
scene_detector = SceneDetector(threshold=27.0) scenes = scene_detector(video_dataset)

Extract clips

Extract clips

clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes)
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0) clips = clip_extractor(scenes)

Generate embeddings

Generate embeddings

video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)
undefined
video_embedder = InternVideo2Embedder() video_embeddings = video_embedder(clips)
undefined

Audio curation

音频整理

python
from nemo_curator.audio import (
    ASRInference,
    WERFilter,
    DurationFilter
)
python
from nemo_curator.audio import (
    ASRInference,
    WERFilter,
    DurationFilter
)

ASR transcription

ASR transcription

asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset)
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc") transcribed = asr(audio_dataset)

Filter by WER (word error rate)

Filter by WER (word error rate)

wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed)
wer_filter = WERFilter(max_wer=0.3) high_quality_audio = wer_filter(transcribed)

Duration filtering

Duration filtering

duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)
undefined
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0) filtered_audio = duration_filter(high_quality_audio)
undefined

Common patterns

常见使用模式

Web scrape curation (Common Crawl)

网络爬取数据整理(Common Crawl)

python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDataset
python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDataset

Load Common Crawl data

Load Common Crawl data

dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

Pipeline

Pipeline

pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),

# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

# 4. PII redaction
PIIRedactor(),

# 5. NSFW filtering
NSFWClassifier(threshold=0.8)
]
pipeline = [ # 1. Quality filtering WordCountFilter(min_words=100, max_words=50000), RepeatedLinesFilter(max_repeated_line_fraction=0.2), SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3), UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),

# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

# 4. PII redaction
PIIRedactor(),

# 5. NSFW filtering
NSFWClassifier(threshold=0.8)
]

Execute

Execute

for stage in pipeline: dataset = stage(dataset)
for stage in pipeline: dataset = stage(dataset)

Save

Save

dataset.to_parquet("curated_common_crawl/")
undefined
dataset.to_parquet("curated_common_crawl/")
undefined

Distributed processing

分布式处理

python
from nemo_curator import get_client
from dask_cuda import LocalCUDACluster
python
from nemo_curator import get_client
from dask_cuda import LocalCUDACluster

Multi-GPU cluster

Multi-GPU cluster

cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster)
cluster = LocalCUDACluster(n_workers=8) client = get_client(cluster=cluster)

Process large dataset

Process large dataset

dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset)
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet") deduped = FuzzyDuplicates(...)(dataset)

Cleanup

Cleanup

client.close() cluster.close()
undefined
client.close() cluster.close()
undefined

Performance benchmarks

性能基准测试

Fuzzy deduplication (8TB RedPajama v2)

模糊去重(8TB RedPajama v2)

  • CPU (256 cores): 120 hours
  • GPU (8× A100): 7.5 hours
  • Speedup: 16×
  • CPU(256核):120小时
  • GPU(8× A100):7.5小时
  • 加速比:16×

Exact deduplication (1TB)

精确去重(1TB)

  • CPU (64 cores): 8 hours
  • GPU (4× A100): 0.5 hours
  • Speedup: 16×
  • CPU(64核):8小时
  • GPU(4× A100):0.5小时
  • 加速比:16×

Quality filtering (100GB)

质量过滤(100GB)

  • CPU (32 cores): 2 hours
  • GPU (2× A100): 0.2 hours
  • Speedup: 10×
  • CPU(32核):2小时
  • GPU(2× A100):0.2小时
  • 加速比:10×

Cost comparison

成本对比

CPU-based curation (AWS c5.18xlarge × 10):
  • Cost: $3.60/hour × 10 = $36/hour
  • Time for 8TB: 120 hours
  • Total: $4,320
GPU-based curation (AWS p4d.24xlarge × 2):
  • Cost: $32.77/hour × 2 = $65.54/hour
  • Time for 8TB: 7.5 hours
  • Total: $491.55
Savings: 89% reduction ($3,828 saved)
基于CPU的整理方案(AWS c5.18xlarge × 10):
  • 成本:$3.60/小时 × 10 = $36/小时
  • 处理8TB数据耗时:120小时
  • 总成本:$4,320
基于GPU的整理方案(AWS p4d.24xlarge × 2):
  • 成本:$32.77/小时 × 2 = $65.54/小时
  • 处理8TB数据耗时:7.5小时
  • 总成本:$491.55
成本节省:降低89%(节省$3,828)

Supported data formats

支持的数据格式

  • Input: Parquet, JSONL, CSV
  • Output: Parquet (recommended), JSONL
  • WebDataset: TAR archives for multi-modal
  • 输入:Parquet、JSONL、CSV
  • 输出:Parquet(推荐)、JSONL
  • WebDataset:用于多模态数据的TAR归档

Use cases

应用场景

Production deployments:
  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data
  • Open-source datasets curated: RedPajama v2, The Pile
生产部署案例
  • NVIDIA使用NeMo Curator制备Nemotron-4的训练数据
  • 开源数据集整理:RedPajama v2、The Pile

References

参考资料

  • Filtering Guide - 30+ quality filters, heuristics
  • Deduplication Guide - Exact, fuzzy, semantic methods
  • 过滤指南 - 30+种质量过滤器及启发式规则
  • 去重指南 - 精确、模糊、语义去重方法

Resources

资源