acestep

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ACE-Step 1.5 Music Generation

ACE-Step 1.5 音乐生成

Open-source music generation (MIT license) via
tools/music_gen.py
. Runs on RunPod serverless. Requires
RUNPOD_API_KEY
and
RUNPOD_ACESTEP_ENDPOINT_ID
in
.env
(run
--setup
to create endpoint).
基于MIT许可证开源的音乐生成工具,通过
tools/music_gen.py
运行,部署在RunPod无服务器服务上。 需要在
.env
文件中配置
RUNPOD_API_KEY
RUNPOD_ACESTEP_ENDPOINT_ID
(运行
--setup
命令可创建端点)。

Quick Reference

快速参考

bash
undefined
bash
undefined

Basic generation

Basic generation

python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3

With musical control

With musical control

python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3

Scene presets (video production)

Scene presets (video production)

python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3 python tools/music_gen.py --preset tension --duration 20 --output problem.mp3 python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3 python tools/music_gen.py --preset tension --duration 20 --output problem.mp3 python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3

Vocals with lyrics

Vocals with lyrics

python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3

Cover / style transfer

Cover / style transfer

python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3

Stem extraction

Stem extraction

python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3

List presets

List presets

python tools/music_gen.py --list-presets
undefined
python tools/music_gen.py --list-presets
undefined

Creating a Song (Step by Step)

创作歌曲(分步指南)

1. Instrumental background track (simplest)

1. 器乐伴奏轨(最简单的场景)

bash
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3
bash
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3

2. Song with vocals and lyrics

2. 带歌词和人声的歌曲

Write lyrics in a temp file or pass inline. Use structure tags to control song sections.
bash
undefined
可以将歌词写入临时文件,或直接内联传入。使用结构标签来控制歌曲的段落结构。
bash
undefined

Write lyrics to a file first (recommended for longer songs)

Write lyrics to a file first (recommended for longer songs)

cat > /tmp/lyrics.txt << 'LYRICS' [Verse 1] Walking through the morning light Coffee in my hand feels right Another day to build and dream Nothing's ever what it seems
[Chorus - anthemic] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about
[Verse 2] Screens are glowing late at night Shipping code until it's right The deadline's close but so are we Almost there, just wait and see
[Chorus - bigger] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about
[Outro - fade] (Moving forward...) LYRICS
cat > /tmp/lyrics.txt << 'LYRICS' [Verse 1] Walking through the morning light Coffee in my hand feels right Another day to build and dream Nothing's ever what it seems
[Chorus - anthemic] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about
[Verse 2] Screens are glowing late at night Shipping code until it's right The deadline's close but so are we Almost there, just wait and see
[Chorus - bigger] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about
[Outro - fade] (Moving forward...) LYRICS

Generate the song

Generate the song

python tools/music_gen.py
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
undefined
python tools/music_gen.py
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
undefined

3. Using a preset for video background

3. 使用预设生成视频背景音乐

bash
python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3
bash
python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3

Key tips for good results

生成优质结果的核心技巧

  • Caption = overall style (genre, instruments, mood, production quality)
  • Lyrics = temporal structure (verse/chorus flow, vocal delivery)
  • UPPERCASE in lyrics = high vocal intensity
  • Parentheses = background vocals: "We rise (together)"
  • Keep 6-10 syllables per line for natural rhythm
  • Don't describe the melody in the caption — describe the sound and feeling
  • Use
    --seed
    to lock randomness when iterating on prompt/lyrics
  • 提示词 = 整体风格(流派、乐器、情绪、制作质量)
  • 歌词 = 时间结构(主歌/副歌流程、人声表现)
  • 歌词中的大写内容 = 高人声强度
  • 括号内容 = 背景人声:"We rise (together)"
  • 每行保持6-10个音节,节奏更自然
  • 不要在提示词中描述旋律,要描述声音感受
  • 使用
    --seed
    参数
    可以在迭代调整提示词/歌词时固定随机种子

Scene Presets

场景预设

PresetBPMKeyUse Case
corporate-bg
110C MajorProfessional background, presentations
upbeat-tech
128G MajorProduct launches, tech demos
ambient
72D MajorOverview slides, reflective content
dramatic
90D MinorReveals, announcements
tension
85A MinorProblem statements, challenges
hopeful
120C MajorSolution reveals, resolutions
cta
135E MajorCall to action, closing energy
lofi
85F MajorScreen recordings, coding demos
预设名BPM调式适用场景
corporate-bg
110C Major专业背景音、演示文稿
upbeat-tech
128G Major产品发布、科技演示
ambient
72D Major概览幻灯片、反思类内容
dramatic
90D Minor内容揭晓、公告
tension
85A Minor问题陈述、挑战展示
hopeful
120C Major方案揭晓、问题解决
cta
135E Major行动号召、收尾高能片段
lofi
85F Major录屏、代码演示

Task Types

任务类型

text2music (default)

text2music(默认)

Generate music from text prompt + optional lyrics.
从文本提示词生成音乐,可选择传入歌词。

cover

cover

Style transfer from reference audio. Control blend with
--cover-strength
(0.0-1.0):
  • 0.2 — Loose style inspiration (more creative freedom)
  • 0.5 — Balanced style transfer
  • 0.7 — Close to original structure (default)
  • 1.0 — Maximum fidelity to source
基于参考音频的风格迁移,可通过
--cover-strength
(0.0-1.0)控制融合程度:
  • 0.2 — 松散的风格参考(创作自由度更高)
  • 0.5 — 平衡的风格迁移
  • 0.7 — 接近原曲结构(默认值)
  • 1.0 — 最大程度还原源文件

extract

extract

Stem separation — isolate individual tracks from mixed audio. Tracks:
vocals
,
drums
,
bass
,
guitar
,
piano
,
keyboard
,
strings
,
brass
,
woodwinds
,
other
分轨分离 — 从混合音频中提取独立音轨。 支持提取的音轨:
vocals
(人声)、
drums
(鼓)、
bass
(贝斯)、
guitar
(吉他)、
piano
(钢琴)、
keyboard
(键盘)、
strings
(弦乐)、
brass
(铜管)、
woodwinds
(木管)、
other
(其他)

repaint (future)

repaint(未来功能)

Regenerate a specific time segment within existing audio while preserving the rest.
重新生成现有音频中指定的时间片段,其余部分保持不变。

lego (future, requires base model)

lego(未来功能,需要基础模型)

Generate individual instrument tracks within an existing audio context.
在现有音频语境下生成独立的乐器音轨。

complete (future, requires base model)

complete(未来功能,需要基础模型)

Extend partial compositions by adding specified instruments.
通过添加指定乐器拓展未完成的作曲。

Prompt Engineering

提示词工程

Caption Writing — Layer Dimensions

提示词撰写 — 分层描述

Write captions by layering multiple descriptive dimensions rather than single-word descriptions.
Dimensions to include:
  • Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
  • Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
  • Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
  • Timbre: warm, crisp, airy, punchy, lush, polished, raw
  • Era: "80s synth-pop", "modern indie", "classical romantic"
  • Production: lo-fi, studio-polished, live recording, cinematic
  • Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")
Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" Bad: "Sad song"
通过叠加多个描述维度来撰写提示词,不要使用单字描述。
需要包含的维度:
  • 流派/风格:pop、rock、jazz、electronic、lo-fi、synthwave、orchestral
  • 情绪/氛围:忧郁的、愉悦的、梦幻的、怀旧的、亲密的、紧张的
  • 乐器:原声吉他、合成器铺垫、808鼓、弦乐、铜管、钢琴
  • 音色:温暖的、清脆的、空灵的、有冲击力的、饱满的、精良制作的、粗糙的
  • 年代:"80s synth-pop"、"modern indie"、"classical romantic"
  • 制作风格:lo-fi、 studio-polished(精良 studio 制作)、live recording(现场录音)、cinematic(电影感)
  • 人声:气息感强的、有力量的、假声、沙哑的、朗诵(或"instrumental"器乐)
优秀示例:"Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" 负面示例:"Sad song"

Key Principles

核心原则

  1. Specificity over vagueness — describe instruments, mood, production style
  2. Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
  3. Repetition reinforces priority — repeat important elements for emphasis
  4. Sparse captions = more creative freedom — detailed captions constrain the model
  5. Use metadata params for BPM/key — don't write "120 BPM" in the caption, use
    --bpm 120
  1. 具体优于模糊 — 描述乐器、情绪、制作风格
  2. 避免矛盾描述 — 不要同时要求"古典弦乐"和"硬核金属"
  3. 重复强化优先级 — 重复重要元素来强调
  4. 简洁提示词 = 更高创作自由度 — 详细的提示词会限制模型的发挥
  5. 使用元参数设置BPM/调式 — 不要在提示词中写"120 BPM",使用
    --bpm 120
    参数

Lyrics Formatting

歌词格式

Structure tags (use in lyrics, not caption):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]
Vocal control (prefix lines or sections):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]
Energy indicators:
  • UPPERCASE = high intensity ("WE RISE ABOVE")
  • Parentheses = background vocals ("We rise (together)")
  • Keep 6-10 syllables per line within sections for natural rhythm
Example — Tech Product Jingle:
[Verse]
Build it better, ship it faster
Every feature tells a story

[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page

[Outro - fade]
(Build it better...)
结构标签(用于歌词中,不要写在提示词里):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]
人声控制(给行或段落加前缀):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]
能量标识:
  • 大写字母 = 高能量("WE RISE ABOVE")
  • 括号 = 背景人声("We rise (together)")
  • 每个段落内每行保持6-10个音节,节奏更自然
示例 — 科技产品广告短曲:
[Verse]
Build it better, ship it faster
Every feature tells a story

[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page

[Outro - fade]
(Build it better...)

Video Production Integration

视频制作集成

Music for Scene Types

不同场景的适配音乐

ScenePresetDurationNotes
Title
dramatic
or
ambient
3-5sShort, mood-setting
Problem
tension
10-15sDark, unsettling
Solution
hopeful
10-15sRelief, optimism
Demo
lofi
or
corporate-bg
30-120sNon-distracting, matches demo length
Stats
upbeat-tech
8-12sBuilding credibility
CTA
cta
5-10sMaximum energy, punchy
Credits
ambient
5-10sGentle fade-out
场景预设时长说明
片头
dramatic
ambient
3-5s简短,铺垫氛围
问题陈述
tension
10-15s低沉,有不安感
方案展示
hopeful
10-15s放松,乐观
演示
lofi
corporate-bg
30-120s无干扰,匹配演示时长
数据展示
upbeat-tech
8-12s增强可信度
行动号召
cta
5-10s最高能量,有冲击力
片尾鸣谢
ambient
5-10s平缓淡出

Timing Workflow

时间同步工作流

  1. Plan scene durations first (from voiceover script)
  2. Generate music to match:
    --duration <scene_seconds>
  3. Music duration is precise (within 0.1s of requested)
  4. For background music spanning multiple scenes: generate one long track
  1. 先规划好各场景时长(根据旁白脚本)
  2. 生成对应时长的音乐:
    --duration <scene_seconds>
  3. 音乐时长精度很高(与请求值误差在0.1s以内)
  4. 跨多场景的背景音乐:生成一个长音轨即可

Combining with Voiceover

与旁白混音

Background music should be mixed at 10-20% volume in Remotion:
tsx
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />
For music under narration: use instrumental presets (
corporate-bg
,
ambient
,
lofi
). For music-forward scenes (title, CTA): can use higher volume or vocal tracks.
背景音乐的音量应在Remotion中混合到10-20%:
tsx
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />
旁白下的背景音乐:使用器乐预设(
corporate-bg
ambient
lofi
)。 音乐为主的场景(片头、行动号召):可以使用更高音量或带人声的音轨。

Brand Consistency

品牌一致性

Use
--brand <name>
to load hints from
brands/<name>/brand.json
. Use
--cover --reference brand_theme.mp3
to create variations of a brand's sonic identity. For consistent sound across a project: fix the seed (
--seed 42
) and vary only duration/prompt.
使用
--brand <name>
brands/<name>/brand.json
加载品牌配置。 使用
--cover --reference brand_theme.mp3
可以生成品牌声音标识的变体。 项目内保持一致的音效:固定种子(
--seed 42
),仅调整时长/提示词即可。

Technical Details

技术细节

  • Output: 48kHz MP3/WAV/FLAC
  • Duration range: 10-600 seconds
  • BPM range: 30-300
  • Inference: ~2-3s on GPU (turbo, 8 steps), ~40-60s on Mac MPS
  • Turbo model: 8 steps, no CFG needed, fast and good quality
  • Shift parameter: 3.0 recommended for turbo (improves quality)
  • 输出格式:48kHz MP3/WAV/FLAC
  • 时长范围:10-600秒
  • BPM范围:30-300
  • 推理速度:GPU上约2-3秒(turbo模式,8步),Mac MPS上约40-60秒
  • Turbo模型:8步,无需CFG,速度快质量好
  • Shift参数:turbo模式推荐设置为3.0(可提升质量)

When NOT to use ACE-Step

不适用ACE-Step的场景

  • Voice cloning — use Qwen3-TTS or ElevenLabs instead
  • Sound effects — use ElevenLabs SFX (
    tools/sfx.py
    )
  • Speech/narration — use voiceover tools, not music gen
  • Stem extraction from video — extract audio first with FFmpeg, then use
    --extract
  • 声音克隆 — 请使用Qwen3-TTS或ElevenLabs
  • 音效生成 — 请使用ElevenLabs SFX(
    tools/sfx.py
  • 语音/旁白生成 — 请使用旁白工具,不要用音乐生成工具
  • 从视频中提取分轨 — 先使用FFmpeg提取音频,再使用
    --extract
    参数