audiocraft-audio-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AudioCraft: Audio Generation

AudioCraft:音频生成

Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
本文是Meta的AudioCraft库的综合使用指南,涵盖基于MusicGen、AudioGen和EnCodec的文本转音乐与文本转音频生成功能。

When to use AudioCraft

何时使用AudioCraft

Use AudioCraft when:
  • Need to generate music from text descriptions
  • Creating sound effects and environmental audio
  • Building music generation applications
  • Need melody-conditioned music generation
  • Want stereo audio output
  • Require controllable music generation with style transfer
Key features:
  • MusicGen: Text-to-music generation with melody conditioning
  • AudioGen: Text-to-sound effects generation
  • EnCodec: High-fidelity neural audio codec
  • Multiple model sizes: Small (300M) to Large (3.3B)
  • Stereo support: Full stereo audio generation
  • Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
  • Stable Audio: For longer commercial music generation
  • Bark: For text-to-speech with music/sound effects
  • Riffusion: For spectogram-based music generation
  • OpenAI Jukebox: For raw audio generation with lyrics
在以下场景使用AudioCraft:
  • 需要根据文本描述生成音乐
  • 创建音效和环境音频
  • 构建音乐生成应用
  • 需要进行旋律条件式音乐生成
  • 需要立体声输出
  • 需要可控制的音乐生成(含风格迁移)
核心特性:
  • MusicGen:支持旋律条件的文本转音乐生成
  • AudioGen:文本转音效生成
  • EnCodec:高保真神经音频编解码器
  • 多种模型尺寸:小型(300M)到大型(3.3B)
  • 立体声支持:完整的立体声音频生成
  • 风格条件:MusicGen-Style支持基于参考的生成
可选择替代方案:
  • Stable Audio:用于生成更长时长的商用音乐
  • Bark:支持带音乐/音效的文本转语音
  • Riffusion:基于频谱图的音乐生成
  • OpenAI Jukebox:支持带歌词的原始音频生成

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

From PyPI

From PyPI

pip install audiocraft
pip install audiocraft

From GitHub (latest)

From GitHub (latest)

Or use HuggingFace Transformers

Or use HuggingFace Transformers

pip install transformers torch torchaudio
undefined
pip install transformers torch torchaudio
undefined

Basic text-to-music (AudioCraft)

基础文本转音乐(AudioCraft)

python
import torchaudio
from audiocraft.models import MusicGen
python
import torchaudio
from audiocraft.models import MusicGen

Load model

Load model

model = MusicGen.get_pretrained('facebook/musicgen-small')
model = MusicGen.get_pretrained('facebook/musicgen-small')

Set generation parameters

Set generation parameters

model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 )
model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 )

Generate from text

Generate from text

descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions)
descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions)

Save audio

Save audio

torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
undefined
torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)
undefined

Using HuggingFace Transformers

使用HuggingFace Transformers

python
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy
python
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy

Load model and processor

Load model and processor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda")
processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda")

Generate music

Generate music

inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda")
audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 )
inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda")
audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 )

Save

Save

sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
undefined
sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())
undefined

Text-to-sound with AudioGen

基于AudioGen的文本转音效

python
from audiocraft.models import AudioGen
python
from audiocraft.models import AudioGen

Load AudioGen

Load AudioGen

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)

Generate sound effects

Generate sound effects

descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
undefined
descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)
undefined

Core concepts

核心概念

Architecture overview

架构概述

AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│                    Text Encoder (T5)                          │
│                         │                                     │
│                    Text Embeddings                            │
└────────────────────────┬─────────────────────────────────────┘
┌────────────────────────▼─────────────────────────────────────┐
│              Transformer Decoder (LM)                         │
│     Auto-regressively generates audio tokens                  │
│     Using efficient token interleaving patterns               │
└────────────────────────┬─────────────────────────────────────┘
┌────────────────────────▼─────────────────────────────────────┐
│                EnCodec Audio Decoder                          │
│        Converts tokens back to audio waveform                 │
└──────────────────────────────────────────────────────────────┘
AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│                    Text Encoder (T5)                          │
│                         │                                     │
│                    Text Embeddings                            │
└────────────────────────┬─────────────────────────────────────┘
┌────────────────────────▼─────────────────────────────────────┐
│              Transformer Decoder (LM)                         │
│     Auto-regressively generates audio tokens                  │
│     Using efficient token interleaving patterns               │
└────────────────────────┬─────────────────────────────────────┘
┌────────────────────────▼─────────────────────────────────────┐
│                EnCodec Audio Decoder                          │
│        Converts tokens back to audio waveform                 │
└──────────────────────────────────────────────────────────────┘

Model variants

模型变体

ModelSizeDescriptionUse Case
musicgen-small
300MText-to-musicQuick generation
musicgen-medium
1.5BText-to-musicBalanced
musicgen-large
3.3BText-to-musicBest quality
musicgen-melody
1.5BText + melodyMelody conditioning
musicgen-melody-large
3.3BText + melodyBest melody
musicgen-stereo-*
VariesStereo outputStereo generation
musicgen-style
1.5BStyle transferReference-based
audiogen-medium
1.5BText-to-soundSound effects
模型尺寸描述适用场景
musicgen-small
300M文本转音乐快速生成
musicgen-medium
1.5B文本转音乐平衡型
musicgen-large
3.3B文本转音乐最佳音质
musicgen-melody
1.5B文本+旋律旋律条件生成
musicgen-melody-large
3.3B文本+旋律最佳旋律生成
musicgen-stereo-*
多种立体声输出立体声生成
musicgen-style
1.5B风格迁移基于参考的生成
audiogen-medium
1.5B文本转音效音效生成

Generation parameters

生成参数

ParameterDefaultDescription
duration
8.0Length in seconds (1-120)
top_k
250Top-k sampling
top_p
0.0Nucleus sampling (0 = disabled)
temperature
1.0Sampling temperature
cfg_coef
3.0Classifier-free guidance
参数默认值描述
duration
8.0生成时长(秒,范围1-120)
top_k
250Top-k采样
top_p
0.0Nucleus采样(0表示禁用)
temperature
1.0采样温度
cfg_coef
3.0无分类器引导系数

MusicGen usage

MusicGen 使用指南

Text-to-music generation

文本转音乐生成

python
from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')
python
from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

Configure generation

配置生成参数

model.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter) )
model.set_generation_params( duration=30, # 最长支持30秒 top_k=250, # 采样多样性 top_p=0.0, # 0表示仅使用top_k temperature=1.0, # 创造性(值越高,结果越多样) cfg_coef=3.0 # 文本匹配度(值越高,越严格遵循提示) )

Generate multiple samples

生成多个样本

descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ]
descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ]

Generate (returns [batch, channels, samples])

生成(返回格式:[批次, 声道, 采样点])

wav = model.generate(descriptions)
wav = model.generate(descriptions)

Save each

保存每个样本

for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
undefined
for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)
undefined

Melody-conditioned generation

旋律条件式生成

python
from audiocraft.models import MusicGen
import torchaudio
python
from audiocraft.models import MusicGen
import torchaudio

Load melody model

加载旋律模型

model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30)
model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30)

Load melody audio

加载旋律音频

melody, sr = torchaudio.load("melody.wav")
melody, sr = torchaudio.load("melody.wav")

Generate with melody conditioning

基于旋律条件生成

descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
undefined
descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)
undefined

Stereo generation

立体声生成

python
from audiocraft.models import MusicGen
python
from audiocraft.models import MusicGen

Load stereo model

加载立体声模型

model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions)
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions)

wav shape: [batch, 2, samples] for stereo

wav形状:[批次, 2, 采样点],对应立体声

print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
undefined
print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)
undefined

Audio continuation

音频续接

python
from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
python
from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

Load audio to continue

加载需要续接的音频

import torchaudio audio, sr = torchaudio.load("intro.wav")
import torchaudio audio, sr = torchaudio.load("intro.wav")

Process with text and audio

结合文本和音频处理

inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" )
inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" )

Generate continuation

生成续接音频

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
undefined
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)
undefined

MusicGen-Style usage

MusicGen-Style 使用指南

Style-conditioned generation

风格条件式生成

python
from audiocraft.models import MusicGen
python
from audiocraft.models import MusicGen

Load style model

加载风格模型

model = MusicGen.get_pretrained('facebook/musicgen-style')
model = MusicGen.get_pretrained('facebook/musicgen-style')

Configure generation with style

配置生成参数(含风格)

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence )
model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # 风格影响程度 )

Configure style conditioner

配置风格条件参数

model.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length )
model.set_style_conditioner_params( eval_q=3, # RVQ量化器(范围1-6) excerpt_length=3.0 # 风格样本时长 )

Load style reference

加载风格参考音频

style_audio, sr = torchaudio.load("reference_style.wav")
style_audio, sr = torchaudio.load("reference_style.wav")

Generate with text + style

结合文本+风格生成

descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr)
undefined
descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr)
undefined

Style-only generation (no text)

仅风格生成(无文本提示)

python
undefined
python
undefined

Generate matching style without text prompt

仅基于风格生成,无需文本提示

model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only )
wav = model.generate_with_style([None], style_audio, sr)
undefined
model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # 仅风格生成时禁用双重CFG )
wav = model.generate_with_style([None], style_audio, sr)
undefined

AudioGen usage

AudioGen 使用指南

Sound effect generation

音效生成

python
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)
python
from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)

Generate various sounds

生成多种音效

descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ]
wav = model.generate(descriptions)
for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
undefined
descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ]
wav = model.generate(descriptions)
for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)
undefined

EnCodec usage

EnCodec 使用指南

Audio compression

音频压缩

python
from audiocraft.models import CompressionModel
import torch
import torchaudio
python
from audiocraft.models import CompressionModel
import torch
import torchaudio

Load EnCodec

加载EnCodec

model = CompressionModel.get_pretrained('facebook/encodec_32khz')
model = CompressionModel.get_pretrained('facebook/encodec_32khz')

Load audio

加载音频

wav, sr = torchaudio.load("audio.wav")
wav, sr = torchaudio.load("audio.wav")

Ensure correct sample rate

确保采样率正确

if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)
if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)

Encode to tokens

编码为 tokens

with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes
with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # 音频tokens

Decode back to audio

解码回音频

with torch.no_grad(): decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
undefined
with torch.no_grad(): decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)
undefined

Common workflows

常见工作流

Workflow 1: Music generation pipeline

工作流1:音乐生成流水线

python
import torch
import torchaudio
from audiocraft.models import MusicGen

class MusicGenerator:
    def __init__(self, model_name="facebook/musicgen-medium"):
        self.model = MusicGen.get_pretrained(model_name)
        self.sample_rate = 32000

    def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
        self.model.set_generation_params(
            duration=duration,
            top_k=250,
            temperature=temperature,
            cfg_coef=cfg
        )

        with torch.no_grad():
            wav = self.model.generate([prompt])

        return wav[0].cpu()

    def generate_batch(self, prompts, duration=30):
        self.model.set_generation_params(duration=duration)

        with torch.no_grad():
            wav = self.model.generate(prompts)

        return wav.cpu()

    def save(self, audio, path):
        torchaudio.save(path, audio, sample_rate=self.sample_rate)
python
import torch
import torchaudio
from audiocraft.models import MusicGen

class MusicGenerator:
    def __init__(self, model_name="facebook/musicgen-medium"):
        self.model = MusicGen.get_pretrained(model_name)
        self.sample_rate = 32000

    def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
        self.model.set_generation_params(
            duration=duration,
            top_k=250,
            temperature=temperature,
            cfg_coef=cfg
        )

        with torch.no_grad():
            wav = self.model.generate([prompt])

        return wav[0].cpu()

    def generate_batch(self, prompts, duration=30):
        self.model.set_generation_params(duration=duration)

        with torch.no_grad():
            wav = self.model.generate(prompts)

        return wav.cpu()

    def save(self, audio, path):
        torchaudio.save(path, audio, sample_rate=self.sample_rate)

Usage

使用示例

generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav")
undefined
generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav")
undefined

Workflow 2: Sound design batch processing

工作流2:音效批量生成

python
import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio

def batch_generate_sounds(sound_specs, output_dir):
    """
    Generate multiple sounds from specifications.

    Args:
        sound_specs: list of {"name": str, "description": str, "duration": float}
        output_dir: output directory path
    """
    model = AudioGen.get_pretrained('facebook/audiogen-medium')
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    results = []

    for spec in sound_specs:
        model.set_generation_params(duration=spec.get("duration", 5))

        wav = model.generate([spec["description"]])

        output_path = output_dir / f"{spec['name']}.wav"
        torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

        results.append({
            "name": spec["name"],
            "path": str(output_path),
            "description": spec["description"]
        })

    return results
python
import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio

def batch_generate_sounds(sound_specs, output_dir):
    """
    根据配置批量生成音效。

    参数:
        sound_specs: 字典列表,包含{"name": str, "description": str, "duration": float}
        output_dir: 输出目录路径
    """
    model = AudioGen.get_pretrained('facebook/audiogen-medium')
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    results = []

    for spec in sound_specs:
        model.set_generation_params(duration=spec.get("duration", 5))

        wav = model.generate([spec["description"]])

        output_path = output_dir / f"{spec['name']}.wav"
        torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

        results.append({
            "name": spec["name"],
            "path": str(output_path),
            "description": spec["description"]
        })

    return results

Usage

使用示例

sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ]
results = batch_generate_sounds(sounds, "sound_effects/")
undefined
sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ]
results = batch_generate_sounds(sounds, "sound_effects/")
undefined

Workflow 3: Gradio demo

工作流3:Gradio演示

python
import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef):
    model.set_generation_params(
        duration=duration,
        temperature=temperature,
        cfg_coef=cfg_coef
    )

    with torch.no_grad():
        wav = model.generate([prompt])

    # Save to temp file
    path = "temp_output.wav"
    torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
    return path

demo = gr.Interface(
    fn=generate_music,
    inputs=[
        gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
        gr.Slider(1, 30, value=8, label="Duration (seconds)"),
        gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
        gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
    ],
    outputs=gr.Audio(label="Generated Music"),
    title="MusicGen Demo"
)

demo.launch()
python
import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef):
    model.set_generation_params(
        duration=duration,
        temperature=temperature,
        cfg_coef=cfg_coef
    )

    with torch.no_grad():
        wav = model.generate([prompt])

    # 保存到临时文件
    path = "temp_output.wav"
    torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
    return path

demo = gr.Interface(
    fn=generate_music,
    inputs=[
        gr.Textbox(label="音乐描述", placeholder="upbeat electronic dance music"),
        gr.Slider(1, 30, value=8, label="生成时长(秒)"),
        gr.Slider(0.5, 2.0, value=1.0, label="采样温度"),
        gr.Slider(1.0, 10.0, value=3.0, label="CFG系数")
    ],
    outputs=gr.Audio(label="生成的音乐"),
    title="MusicGen 演示"
)

demo.launch()

Performance optimization

性能优化

Memory optimization

内存优化

python
undefined
python
undefined

Use smaller model

使用小型模型

model = MusicGen.get_pretrained('facebook/musicgen-small')
model = MusicGen.get_pretrained('facebook/musicgen-small')

Clear cache between generations

生成之间清理缓存

torch.cuda.empty_cache()
torch.cuda.empty_cache()

Generate shorter durations

缩短生成时长

model.set_generation_params(duration=10) # Instead of 30
model.set_generation_params(duration=10) # 替代30秒

Use half precision

使用半精度

model = model.half()
undefined
model = model.half()
undefined

Batch processing efficiency

批量处理效率

python
undefined
python
undefined

Process multiple prompts at once (more efficient)

一次性处理多个提示(更高效)

descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # Single batch
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # 单批次处理

Instead of

避免以下方式(多批次处理,速度更慢)

for desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)
undefined
for desc in descriptions: wav = model.generate([desc]) # 多批次处理
undefined

GPU memory requirements

GPU内存需求

ModelFP32 VRAMFP16 VRAM
musicgen-small~4GB~2GB
musicgen-medium~8GB~4GB
musicgen-large~16GB~8GB
模型FP32显存FP16显存
musicgen-small~4GB~2GB
musicgen-medium~8GB~4GB
musicgen-large~16GB~8GB

Common issues

常见问题

IssueSolution
CUDA OOMUse smaller model, reduce duration
Poor qualityIncrease cfg_coef, better prompts
Generation too shortCheck max duration setting
Audio artifactsTry different temperature
Stereo not workingUse stereo model variant
问题解决方案
CUDA内存不足使用更小的模型,缩短生成时长
生成质量差提高cfg_coef,优化提示词
生成音频过短检查最大时长设置
音频存在杂音尝试调整采样温度
立体声不生效使用立体声模型变体

References

参考资料

  • Advanced Usage - Training, fine-tuning, deployment
  • Troubleshooting - Common issues and solutions
  • 高级用法 - 训练、微调、部署
  • 故障排除 - 常见问题与解决方案

Resources

资源链接