nemo-curator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeMo Curator - GPU-Accelerated Data Curation
NeMo Curator - GPU加速的数据整理工具
NVIDIA's toolkit for preparing high-quality training data for LLMs.
NVIDIA推出的用于为LLM制备高质量训练数据的工具包。
When to use NeMo Curator
何时使用NeMo Curator
Use NeMo Curator when:
- Preparing LLM training data from web scrapes (Common Crawl)
- Need fast deduplication (16× faster than CPU)
- Curating multi-modal datasets (text, images, video, audio)
- Filtering low-quality or toxic content
- Scaling data processing across GPU cluster
Performance:
- 16× faster fuzzy deduplication (8TB RedPajama v2)
- 40% lower TCO vs CPU alternatives
- Near-linear scaling across GPU nodes
Use alternatives instead:
- datatrove: CPU-based, open-source data processing
- dolma: Allen AI's data toolkit
- Ray Data: General ML data processing (no curation focus)
在以下场景使用NeMo Curator:
- 从网络爬取数据(如Common Crawl)制备LLM训练数据
- 需要快速去重(比CPU快16倍)
- 整理多模态数据集(文本、图像、视频、音频)
- 过滤低质量或有害内容
- 在GPU集群上扩展数据处理
性能表现:
- 模糊去重速度提升16倍(处理8TB RedPajama v2数据集)
- 与CPU方案相比,TCO降低40%
- 在GPU节点间实现近线性扩展
可选择替代方案:
- datatrove:基于CPU的开源数据处理工具
- dolma:Allen AI的数据工具包
- Ray Data:通用机器学习数据处理工具(无数据整理专项优化)
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedText curation (CUDA 12)
Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"
uv pip install "nemo-curator[text_cuda12]"
All modalities
All modalities
uv pip install "nemo-curator[all_cuda12]"
uv pip install "nemo-curator[all_cuda12]"
CPU-only (slower)
CPU-only (slower)
uv pip install "nemo-curator[cpu]"
undefineduv pip install "nemo-curator[cpu]"
undefinedBasic text curation pipeline
基础文本整理流水线
python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pdpython
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pdLoad data
Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)
Quality filtering
Quality filtering
def quality_score(doc):
return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
def quality_score(doc):
return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
Deduplication
Deduplication
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)
Save
Save
deduped.to_parquet("curated_data/")
undefineddeduped.to_parquet("curated_data/")
undefinedData curation pipeline
数据整理流水线
Stage 1: Quality filtering
阶段1:质量过滤
python
from nemo_curator.filters import (
WordCountFilter,
RepeatedLinesFilter,
UrlRatioFilter,
NonAlphaNumericFilter
)python
from nemo_curator.filters import (
WordCountFilter,
RepeatedLinesFilter,
UrlRatioFilter,
NonAlphaNumericFilter
)Apply 30+ heuristic filters
Apply 30+ heuristic filters
from nemo_curator import ScoreFilter
from nemo_curator import ScoreFilter
Word count filter
Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
Remove repetitive content
Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
URL ratio filter
URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
undefineddataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
undefinedStage 2: Deduplication
阶段2:去重
Exact deduplication:
python
from nemo_curator.modules import ExactDuplicates精确去重:
python
from nemo_curator.modules import ExactDuplicatesRemove exact duplicates
Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
**Fuzzy deduplication** (16× faster on GPU):
```python
from nemo_curator.modules import FuzzyDuplicatesdeduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
**模糊去重**(GPU上速度提升16倍):
```python
from nemo_curator.modules import FuzzyDuplicatesMinHash + LSH deduplication
MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates(
id_field="id",
text_field="text",
num_hashes=260, # MinHash parameters
num_buckets=20,
hash_method="md5"
)
deduped = fuzzy_dedup(dataset)
**Semantic deduplication**:
```python
from nemo_curator.modules import SemanticDuplicatesfuzzy_dedup = FuzzyDuplicates(
id_field="id",
text_field="text",
num_hashes=260, # MinHash parameters
num_buckets=20,
hash_method="md5"
)
deduped = fuzzy_dedup(dataset)
**语义去重**:
```python
from nemo_curator.modules import SemanticDuplicatesEmbedding-based deduplication
Embedding-based deduplication
semantic_dedup = SemanticDuplicates(
id_field="id",
text_field="text",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
threshold=0.8 # Cosine similarity threshold
)
deduped = semantic_dedup(dataset)
undefinedsemantic_dedup = SemanticDuplicates(
id_field="id",
text_field="text",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
threshold=0.8 # Cosine similarity threshold
)
deduped = semantic_dedup(dataset)
undefinedStage 3: PII redaction
阶段3:PII信息脱敏
python
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactorpython
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactorRedact personally identifiable information
Redact personally identifiable information
pii_redactor = PIIRedactor(
supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
anonymize_action="replace" # or "redact"
)
redacted = Modify(pii_redactor)(dataset)
undefinedpii_redactor = PIIRedactor(
supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
anonymize_action="replace" # or "redact"
)
redacted = Modify(pii_redactor)(dataset)
undefinedStage 4: Classifier filtering
阶段4:分类器过滤
python
from nemo_curator.classifiers import QualityClassifierpython
from nemo_curator.classifiers import QualityClassifierQuality classification
Quality classification
quality_clf = QualityClassifier(
model_path="nvidia/quality-classifier-deberta",
batch_size=256,
device="cuda"
)
quality_clf = QualityClassifier(
model_path="nvidia/quality-classifier-deberta",
batch_size=256,
device="cuda"
)
Filter low-quality documents
Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
undefinedhigh_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
undefinedGPU acceleration
GPU加速
GPU vs CPU performance
GPU与CPU性能对比
| Operation | CPU (16 cores) | GPU (A100) | Speedup |
|---|---|---|---|
| Fuzzy dedup (8TB) | 120 hours | 7.5 hours | 16× |
| Exact dedup (1TB) | 8 hours | 0.5 hours | 16× |
| Quality filtering | 2 hours | 0.2 hours | 10× |
| 操作 | CPU(16核) | GPU(A100) | 加速比 |
|---|---|---|---|
| 模糊去重(8TB) | 120小时 | 7.5小时 | 16× |
| 精确去重(1TB) | 8小时 | 0.5小时 | 16× |
| 质量过滤 | 2小时 | 0.2小时 | 10× |
Multi-GPU scaling
多GPU扩展
python
from nemo_curator import get_client
import dask_cudapython
from nemo_curator import get_client
import dask_cudaInitialize GPU cluster
Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)
client = get_client(cluster_type="gpu", n_workers=8)
Process with 8 GPUs
Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)
undefineddeduped = FuzzyDuplicates(...)(dataset)
undefinedMulti-modal curation
多模态数据整理
Image curation
图像整理
python
from nemo_curator.image import (
AestheticFilter,
NSFWFilter,
CLIPEmbedder
)python
from nemo_curator.image import (
AestheticFilter,
NSFWFilter,
CLIPEmbedder
)Aesthetic scoring
Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)
NSFW detection
NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)
Generate CLIP embeddings
Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)
undefinedclip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)
undefinedVideo curation
视频整理
python
from nemo_curator.video import (
SceneDetector,
ClipExtractor,
InternVideo2Embedder
)python
from nemo_curator.video import (
SceneDetector,
ClipExtractor,
InternVideo2Embedder
)Detect scenes
Detect scenes
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)
Extract clips
Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)
Generate embeddings
Generate embeddings
video_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)
undefinedvideo_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)
undefinedAudio curation
音频整理
python
from nemo_curator.audio import (
ASRInference,
WERFilter,
DurationFilter
)python
from nemo_curator.audio import (
ASRInference,
WERFilter,
DurationFilter
)ASR transcription
ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)
Filter by WER (word error rate)
Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)
Duration filtering
Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)
undefinedduration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)
undefinedCommon patterns
常见使用模式
Web scrape curation (Common Crawl)
网络爬取数据整理(Common Crawl)
python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDatasetpython
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDatasetLoad Common Crawl data
Load Common Crawl data
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
Pipeline
Pipeline
pipeline = [
# 1. Quality filtering
WordCountFilter(min_words=100, max_words=50000),
RepeatedLinesFilter(max_repeated_line_fraction=0.2),
SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),
# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
# 4. PII redaction
PIIRedactor(),
# 5. NSFW filtering
NSFWClassifier(threshold=0.8)]
pipeline = [
# 1. Quality filtering
WordCountFilter(min_words=100, max_words=50000),
RepeatedLinesFilter(max_repeated_line_fraction=0.2),
SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),
# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
# 4. PII redaction
PIIRedactor(),
# 5. NSFW filtering
NSFWClassifier(threshold=0.8)]
Execute
Execute
for stage in pipeline:
dataset = stage(dataset)
for stage in pipeline:
dataset = stage(dataset)
Save
Save
dataset.to_parquet("curated_common_crawl/")
undefineddataset.to_parquet("curated_common_crawl/")
undefinedDistributed processing
分布式处理
python
from nemo_curator import get_client
from dask_cuda import LocalCUDAClusterpython
from nemo_curator import get_client
from dask_cuda import LocalCUDAClusterMulti-GPU cluster
Multi-GPU cluster
cluster = LocalCUDACluster(n_workers=8)
client = get_client(cluster=cluster)
cluster = LocalCUDACluster(n_workers=8)
client = get_client(cluster=cluster)
Process large dataset
Process large dataset
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
deduped = FuzzyDuplicates(...)(dataset)
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
deduped = FuzzyDuplicates(...)(dataset)
Cleanup
Cleanup
client.close()
cluster.close()
undefinedclient.close()
cluster.close()
undefinedPerformance benchmarks
性能基准测试
Fuzzy deduplication (8TB RedPajama v2)
模糊去重(8TB RedPajama v2)
- CPU (256 cores): 120 hours
- GPU (8× A100): 7.5 hours
- Speedup: 16×
- CPU(256核):120小时
- GPU(8× A100):7.5小时
- 加速比:16×
Exact deduplication (1TB)
精确去重(1TB)
- CPU (64 cores): 8 hours
- GPU (4× A100): 0.5 hours
- Speedup: 16×
- CPU(64核):8小时
- GPU(4× A100):0.5小时
- 加速比:16×
Quality filtering (100GB)
质量过滤(100GB)
- CPU (32 cores): 2 hours
- GPU (2× A100): 0.2 hours
- Speedup: 10×
- CPU(32核):2小时
- GPU(2× A100):0.2小时
- 加速比:10×
Cost comparison
成本对比
CPU-based curation (AWS c5.18xlarge × 10):
- Cost: $3.60/hour × 10 = $36/hour
- Time for 8TB: 120 hours
- Total: $4,320
GPU-based curation (AWS p4d.24xlarge × 2):
- Cost: $32.77/hour × 2 = $65.54/hour
- Time for 8TB: 7.5 hours
- Total: $491.55
Savings: 89% reduction ($3,828 saved)
基于CPU的整理方案(AWS c5.18xlarge × 10):
- 成本:$3.60/小时 × 10 = $36/小时
- 处理8TB数据耗时:120小时
- 总成本:$4,320
基于GPU的整理方案(AWS p4d.24xlarge × 2):
- 成本:$32.77/小时 × 2 = $65.54/小时
- 处理8TB数据耗时:7.5小时
- 总成本:$491.55
成本节省:降低89%(节省$3,828)
Supported data formats
支持的数据格式
- Input: Parquet, JSONL, CSV
- Output: Parquet (recommended), JSONL
- WebDataset: TAR archives for multi-modal
- 输入:Parquet、JSONL、CSV
- 输出:Parquet(推荐)、JSONL
- WebDataset:用于多模态数据的TAR归档
Use cases
应用场景
Production deployments:
- NVIDIA used NeMo Curator to prepare Nemotron-4 training data
- Open-source datasets curated: RedPajama v2, The Pile
生产部署案例:
- NVIDIA使用NeMo Curator制备Nemotron-4的训练数据
- 开源数据集整理:RedPajama v2、The Pile
References
参考资料
- Filtering Guide - 30+ quality filters, heuristics
- Deduplication Guide - Exact, fuzzy, semantic methods
- 过滤指南 - 30+种质量过滤器及启发式规则
- 去重指南 - 精确、模糊、语义去重方法
Resources
资源
- GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
- Docs: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
- Version: 0.4.0+
- License: Apache 2.0
- GitHub:https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
- 文档:https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
- 版本:0.4.0+
- 许可证:Apache 2.0