audiocraft-audio-generation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAudioCraft: Audio Generation
AudioCraft:音频生成
Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
本文是Meta的AudioCraft库的综合使用指南,涵盖基于MusicGen、AudioGen和EnCodec的文本转音乐与文本转音频生成功能。
When to use AudioCraft
何时使用AudioCraft
Use AudioCraft when:
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
Key features:
- MusicGen: Text-to-music generation with melody conditioning
- AudioGen: Text-to-sound effects generation
- EnCodec: High-fidelity neural audio codec
- Multiple model sizes: Small (300M) to Large (3.3B)
- Stereo support: Full stereo audio generation
- Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
- Stable Audio: For longer commercial music generation
- Bark: For text-to-speech with music/sound effects
- Riffusion: For spectogram-based music generation
- OpenAI Jukebox: For raw audio generation with lyrics
在以下场景使用AudioCraft:
- 需要根据文本描述生成音乐
- 创建音效和环境音频
- 构建音乐生成应用
- 需要进行旋律条件式音乐生成
- 需要立体声输出
- 需要可控制的音乐生成(含风格迁移)
核心特性:
- MusicGen:支持旋律条件的文本转音乐生成
- AudioGen:文本转音效生成
- EnCodec:高保真神经音频编解码器
- 多种模型尺寸:小型(300M)到大型(3.3B)
- 立体声支持:完整的立体声音频生成
- 风格条件:MusicGen-Style支持基于参考的生成
可选择替代方案:
- Stable Audio:用于生成更长时长的商用音乐
- Bark:支持带音乐/音效的文本转语音
- Riffusion:基于频谱图的音乐生成
- OpenAI Jukebox:支持带歌词的原始音频生成
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedFrom PyPI
From PyPI
pip install audiocraft
pip install audiocraft
From GitHub (latest)
From GitHub (latest)
pip install git+https://github.com/facebookresearch/audiocraft.git
pip install git+https://github.com/facebookresearch/audiocraft.git
Or use HuggingFace Transformers
Or use HuggingFace Transformers
pip install transformers torch torchaudio
undefinedpip install transformers torch torchaudio
undefinedBasic text-to-music (AudioCraft)
基础文本转音乐(AudioCraft)
python
import torchaudio
from audiocraft.models import MusicGenpython
import torchaudio
from audiocraft.models import MusicGenLoad model
Load model
model = MusicGen.get_pretrained('facebook/musicgen-small')
model = MusicGen.get_pretrained('facebook/musicgen-small')
Set generation parameters
Set generation parameters
model.set_generation_params(
duration=8, # seconds
top_k=250,
temperature=1.0
)
model.set_generation_params(
duration=8, # seconds
top_k=250,
temperature=1.0
)
Generate from text
Generate from text
descriptions = ["happy upbeat electronic dance music with synths"]
wav = model.generate(descriptions)
descriptions = ["happy upbeat electronic dance music with synths"]
wav = model.generate(descriptions)
Save audio
Save audio
torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
undefinedtorchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
undefinedUsing HuggingFace Transformers
使用HuggingFace Transformers
python
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipypython
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipyLoad model and processor
Load model and processor
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model.to("cuda")
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model.to("cuda")
Generate music
Generate music
inputs = processor(
text=["80s pop track with bassy drums and synth"],
padding=True,
return_tensors="pt"
).to("cuda")
audio_values = model.generate(
**inputs,
do_sample=True,
guidance_scale=3,
max_new_tokens=256
)
inputs = processor(
text=["80s pop track with bassy drums and synth"],
padding=True,
return_tensors="pt"
).to("cuda")
audio_values = model.generate(
**inputs,
do_sample=True,
guidance_scale=3,
max_new_tokens=256
)
Save
Save
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
undefinedsampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
undefinedText-to-sound with AudioGen
基于AudioGen的文本转音效
python
from audiocraft.models import AudioGenpython
from audiocraft.models import AudioGenLoad AudioGen
Load AudioGen
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
Generate sound effects
Generate sound effects
descriptions = ["dog barking in a park with birds chirping"]
wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
undefineddescriptions = ["dog barking in a park with birds chirping"]
wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
undefinedCore concepts
核心概念
Architecture overview
架构概述
AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘Model variants
模型变体
| Model | Size | Description | Use Case |
|---|---|---|---|
| 300M | Text-to-music | Quick generation |
| 1.5B | Text-to-music | Balanced |
| 3.3B | Text-to-music | Best quality |
| 1.5B | Text + melody | Melody conditioning |
| 3.3B | Text + melody | Best melody |
| Varies | Stereo output | Stereo generation |
| 1.5B | Style transfer | Reference-based |
| 1.5B | Text-to-sound | Sound effects |
| 模型 | 尺寸 | 描述 | 适用场景 |
|---|---|---|---|
| 300M | 文本转音乐 | 快速生成 |
| 1.5B | 文本转音乐 | 平衡型 |
| 3.3B | 文本转音乐 | 最佳音质 |
| 1.5B | 文本+旋律 | 旋律条件生成 |
| 3.3B | 文本+旋律 | 最佳旋律生成 |
| 多种 | 立体声输出 | 立体声生成 |
| 1.5B | 风格迁移 | 基于参考的生成 |
| 1.5B | 文本转音效 | 音效生成 |
Generation parameters
生成参数
| Parameter | Default | Description |
|---|---|---|
| 8.0 | Length in seconds (1-120) |
| 250 | Top-k sampling |
| 0.0 | Nucleus sampling (0 = disabled) |
| 1.0 | Sampling temperature |
| 3.0 | Classifier-free guidance |
| 参数 | 默认值 | 描述 |
|---|---|---|
| 8.0 | 生成时长(秒,范围1-120) |
| 250 | Top-k采样 |
| 0.0 | Nucleus采样(0表示禁用) |
| 1.0 | 采样温度 |
| 3.0 | 无分类器引导系数 |
MusicGen usage
MusicGen 使用指南
Text-to-music generation
文本转音乐生成
python
from audiocraft.models import MusicGen
import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-medium')python
from audiocraft.models import MusicGen
import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-medium')Configure generation
配置生成参数
model.set_generation_params(
duration=30, # Up to 30 seconds
top_k=250, # Sampling diversity
top_p=0.0, # 0 = use top_k only
temperature=1.0, # Creativity (higher = more varied)
cfg_coef=3.0 # Text adherence (higher = stricter)
)
model.set_generation_params(
duration=30, # 最长支持30秒
top_k=250, # 采样多样性
top_p=0.0, # 0表示仅使用top_k
temperature=1.0, # 创造性(值越高,结果越多样)
cfg_coef=3.0 # 文本匹配度(值越高,越严格遵循提示)
)
Generate multiple samples
生成多个样本
descriptions = [
"epic orchestral soundtrack with strings and brass",
"chill lo-fi hip hop beat with jazzy piano",
"energetic rock song with electric guitar"
]
descriptions = [
"epic orchestral soundtrack with strings and brass",
"chill lo-fi hip hop beat with jazzy piano",
"energetic rock song with electric guitar"
]
Generate (returns [batch, channels, samples])
生成(返回格式:[批次, 声道, 采样点])
wav = model.generate(descriptions)
wav = model.generate(descriptions)
Save each
保存每个样本
for i, audio in enumerate(wav):
torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
undefinedfor i, audio in enumerate(wav):
torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
undefinedMelody-conditioned generation
旋律条件式生成
python
from audiocraft.models import MusicGen
import torchaudiopython
from audiocraft.models import MusicGen
import torchaudioLoad melody model
加载旋律模型
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)
Load melody audio
加载旋律音频
melody, sr = torchaudio.load("melody.wav")
melody, sr = torchaudio.load("melody.wav")
Generate with melody conditioning
基于旋律条件生成
descriptions = ["acoustic guitar folk song"]
wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
undefineddescriptions = ["acoustic guitar folk song"]
wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
undefinedStereo generation
立体声生成
python
from audiocraft.models import MusicGenpython
from audiocraft.models import MusicGenLoad stereo model
加载立体声模型
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"]
wav = model.generate(descriptions)
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"]
wav = model.generate(descriptions)
wav shape: [batch, 2, samples] for stereo
wav形状:[批次, 2, 采样点],对应立体声
print(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
undefinedprint(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
undefinedAudio continuation
音频续接
python
from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")python
from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")Load audio to continue
加载需要续接的音频
import torchaudio
audio, sr = torchaudio.load("intro.wav")
import torchaudio
audio, sr = torchaudio.load("intro.wav")
Process with text and audio
结合文本和音频处理
inputs = processor(
audio=audio.squeeze().numpy(),
sampling_rate=sr,
text=["continue with a epic chorus"],
padding=True,
return_tensors="pt"
)
inputs = processor(
audio=audio.squeeze().numpy(),
sampling_rate=sr,
text=["continue with a epic chorus"],
padding=True,
return_tensors="pt"
)
Generate continuation
生成续接音频
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
undefinedaudio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
undefinedMusicGen-Style usage
MusicGen-Style 使用指南
Style-conditioned generation
风格条件式生成
python
from audiocraft.models import MusicGenpython
from audiocraft.models import MusicGenLoad style model
加载风格模型
model = MusicGen.get_pretrained('facebook/musicgen-style')
model = MusicGen.get_pretrained('facebook/musicgen-style')
Configure generation with style
配置生成参数(含风格)
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=5.0 # Style influence
)
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=5.0 # 风格影响程度
)
Configure style conditioner
配置风格条件参数
model.set_style_conditioner_params(
eval_q=3, # RVQ quantizers (1-6)
excerpt_length=3.0 # Style excerpt length
)
model.set_style_conditioner_params(
eval_q=3, # RVQ量化器(范围1-6)
excerpt_length=3.0 # 风格样本时长
)
Load style reference
加载风格参考音频
style_audio, sr = torchaudio.load("reference_style.wav")
style_audio, sr = torchaudio.load("reference_style.wav")
Generate with text + style
结合文本+风格生成
descriptions = ["upbeat dance track"]
wav = model.generate_with_style(descriptions, style_audio, sr)
undefineddescriptions = ["upbeat dance track"]
wav = model.generate_with_style(descriptions, style_audio, sr)
undefinedStyle-only generation (no text)
仅风格生成(无文本提示)
python
undefinedpython
undefinedGenerate matching style without text prompt
仅基于风格生成,无需文本提示
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=None # Disable double CFG for style-only
)
wav = model.generate_with_style([None], style_audio, sr)
undefinedmodel.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=None # 仅风格生成时禁用双重CFG
)
wav = model.generate_with_style([None], style_audio, sr)
undefinedAudioGen usage
AudioGen 使用指南
Sound effect generation
音效生成
python
from audiocraft.models import AudioGen
import torchaudio
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)python
from audiocraft.models import AudioGen
import torchaudio
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)Generate various sounds
生成多种音效
descriptions = [
"thunderstorm with heavy rain and lightning",
"busy city traffic with car horns",
"ocean waves crashing on rocks",
"crackling campfire in forest"
]
wav = model.generate(descriptions)
for i, audio in enumerate(wav):
torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
undefineddescriptions = [
"thunderstorm with heavy rain and lightning",
"busy city traffic with car horns",
"ocean waves crashing on rocks",
"crackling campfire in forest"
]
wav = model.generate(descriptions)
for i, audio in enumerate(wav):
torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
undefinedEnCodec usage
EnCodec 使用指南
Audio compression
音频压缩
python
from audiocraft.models import CompressionModel
import torch
import torchaudiopython
from audiocraft.models import CompressionModel
import torch
import torchaudioLoad EnCodec
加载EnCodec
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
Load audio
加载音频
wav, sr = torchaudio.load("audio.wav")
wav, sr = torchaudio.load("audio.wav")
Ensure correct sample rate
确保采样率正确
if sr != 32000:
resampler = torchaudio.transforms.Resample(sr, 32000)
wav = resampler(wav)
if sr != 32000:
resampler = torchaudio.transforms.Resample(sr, 32000)
wav = resampler(wav)
Encode to tokens
编码为 tokens
with torch.no_grad():
encoded = model.encode(wav.unsqueeze(0))
codes = encoded[0] # Audio codes
with torch.no_grad():
encoded = model.encode(wav.unsqueeze(0))
codes = encoded[0] # 音频tokens
Decode back to audio
解码回音频
with torch.no_grad():
decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
undefinedwith torch.no_grad():
decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
undefinedCommon workflows
常见工作流
Workflow 1: Music generation pipeline
工作流1:音乐生成流水线
python
import torch
import torchaudio
from audiocraft.models import MusicGen
class MusicGenerator:
def __init__(self, model_name="facebook/musicgen-medium"):
self.model = MusicGen.get_pretrained(model_name)
self.sample_rate = 32000
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
self.model.set_generation_params(
duration=duration,
top_k=250,
temperature=temperature,
cfg_coef=cfg
)
with torch.no_grad():
wav = self.model.generate([prompt])
return wav[0].cpu()
def generate_batch(self, prompts, duration=30):
self.model.set_generation_params(duration=duration)
with torch.no_grad():
wav = self.model.generate(prompts)
return wav.cpu()
def save(self, audio, path):
torchaudio.save(path, audio, sample_rate=self.sample_rate)python
import torch
import torchaudio
from audiocraft.models import MusicGen
class MusicGenerator:
def __init__(self, model_name="facebook/musicgen-medium"):
self.model = MusicGen.get_pretrained(model_name)
self.sample_rate = 32000
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
self.model.set_generation_params(
duration=duration,
top_k=250,
temperature=temperature,
cfg_coef=cfg
)
with torch.no_grad():
wav = self.model.generate([prompt])
return wav[0].cpu()
def generate_batch(self, prompts, duration=30):
self.model.set_generation_params(duration=duration)
with torch.no_grad():
wav = self.model.generate(prompts)
return wav.cpu()
def save(self, audio, path):
torchaudio.save(path, audio, sample_rate=self.sample_rate)Usage
使用示例
generator = MusicGenerator()
audio = generator.generate(
"epic cinematic orchestral music",
duration=30,
temperature=1.0
)
generator.save(audio, "epic_music.wav")
undefinedgenerator = MusicGenerator()
audio = generator.generate(
"epic cinematic orchestral music",
duration=30,
temperature=1.0
)
generator.save(audio, "epic_music.wav")
undefinedWorkflow 2: Sound design batch processing
工作流2:音效批量生成
python
import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio
def batch_generate_sounds(sound_specs, output_dir):
"""
Generate multiple sounds from specifications.
Args:
sound_specs: list of {"name": str, "description": str, "duration": float}
output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
results = []
for spec in sound_specs:
model.set_generation_params(duration=spec.get("duration", 5))
wav = model.generate([spec["description"]])
output_path = output_dir / f"{spec['name']}.wav"
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
results.append({
"name": spec["name"],
"path": str(output_path),
"description": spec["description"]
})
return resultspython
import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio
def batch_generate_sounds(sound_specs, output_dir):
"""
根据配置批量生成音效。
参数:
sound_specs: 字典列表,包含{"name": str, "description": str, "duration": float}
output_dir: 输出目录路径
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
results = []
for spec in sound_specs:
model.set_generation_params(duration=spec.get("duration", 5))
wav = model.generate([spec["description"]])
output_path = output_dir / f"{spec['name']}.wav"
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
results.append({
"name": spec["name"],
"path": str(output_path),
"description": spec["description"]
})
return resultsUsage
使用示例
sounds = [
{"name": "explosion", "description": "massive explosion with debris", "duration": 3},
{"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
{"name": "door", "description": "wooden door creaking and closing", "duration": 2}
]
results = batch_generate_sounds(sounds, "sound_effects/")
undefinedsounds = [
{"name": "explosion", "description": "massive explosion with debris", "duration": 3},
{"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
{"name": "door", "description": "wooden door creaking and closing", "duration": 2}
]
results = batch_generate_sounds(sounds, "sound_effects/")
undefinedWorkflow 3: Gradio demo
工作流3:Gradio演示
python
import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
def generate_music(prompt, duration, temperature, cfg_coef):
model.set_generation_params(
duration=duration,
temperature=temperature,
cfg_coef=cfg_coef
)
with torch.no_grad():
wav = model.generate([prompt])
# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path
demo = gr.Interface(
fn=generate_music,
inputs=[
gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
gr.Slider(1, 30, value=8, label="Duration (seconds)"),
gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
],
outputs=gr.Audio(label="Generated Music"),
title="MusicGen Demo"
)
demo.launch()python
import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
def generate_music(prompt, duration, temperature, cfg_coef):
model.set_generation_params(
duration=duration,
temperature=temperature,
cfg_coef=cfg_coef
)
with torch.no_grad():
wav = model.generate([prompt])
# 保存到临时文件
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path
demo = gr.Interface(
fn=generate_music,
inputs=[
gr.Textbox(label="音乐描述", placeholder="upbeat electronic dance music"),
gr.Slider(1, 30, value=8, label="生成时长(秒)"),
gr.Slider(0.5, 2.0, value=1.0, label="采样温度"),
gr.Slider(1.0, 10.0, value=3.0, label="CFG系数")
],
outputs=gr.Audio(label="生成的音乐"),
title="MusicGen 演示"
)
demo.launch()Performance optimization
性能优化
Memory optimization
内存优化
python
undefinedpython
undefinedUse smaller model
使用小型模型
model = MusicGen.get_pretrained('facebook/musicgen-small')
model = MusicGen.get_pretrained('facebook/musicgen-small')
Clear cache between generations
生成之间清理缓存
torch.cuda.empty_cache()
torch.cuda.empty_cache()
Generate shorter durations
缩短生成时长
model.set_generation_params(duration=10) # Instead of 30
model.set_generation_params(duration=10) # 替代30秒
Use half precision
使用半精度
model = model.half()
undefinedmodel = model.half()
undefinedBatch processing efficiency
批量处理效率
python
undefinedpython
undefinedProcess multiple prompts at once (more efficient)
一次性处理多个提示(更高效)
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
wav = model.generate(descriptions) # Single batch
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
wav = model.generate(descriptions) # 单批次处理
Instead of
避免以下方式(多批次处理,速度更慢)
for desc in descriptions:
wav = model.generate([desc]) # Multiple batches (slower)
undefinedfor desc in descriptions:
wav = model.generate([desc]) # 多批次处理
undefinedGPU memory requirements
GPU内存需求
| Model | FP32 VRAM | FP16 VRAM |
|---|---|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
| 模型 | FP32显存 | FP16显存 |
|---|---|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
Common issues
常见问题
| Issue | Solution |
|---|---|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |
| 问题 | 解决方案 |
|---|---|
| CUDA内存不足 | 使用更小的模型,缩短生成时长 |
| 生成质量差 | 提高cfg_coef,优化提示词 |
| 生成音频过短 | 检查最大时长设置 |
| 音频存在杂音 | 尝试调整采样温度 |
| 立体声不生效 | 使用立体声模型变体 |
References
参考资料
- Advanced Usage - Training, fine-tuning, deployment
- Troubleshooting - Common issues and solutions
- 高级用法 - 训练、微调、部署
- 故障排除 - 常见问题与解决方案
Resources
资源链接
- GitHub: https://github.com/facebookresearch/audiocraft
- Paper (MusicGen): https://arxiv.org/abs/2306.05284
- Paper (AudioGen): https://arxiv.org/abs/2209.15352
- HuggingFace: https://huggingface.co/facebook/musicgen-small
- Demo: https://huggingface.co/spaces/facebook/MusicGen
- GitHub: https://github.com/facebookresearch/audiocraft
- MusicGen 论文: https://arxiv.org/abs/2306.05284
- AudioGen 论文: https://arxiv.org/abs/2209.15352
- HuggingFace: https://huggingface.co/facebook/musicgen-small
- 在线演示: https://huggingface.co/spaces/facebook/MusicGen