acestep
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseACE-Step 1.5 Music Generation
ACE-Step 1.5 音乐生成
Open-source music generation (MIT license) via . Runs on RunPod serverless.
Requires and in (run to create endpoint).
tools/music_gen.pyRUNPOD_API_KEYRUNPOD_ACESTEP_ENDPOINT_ID.env--setup基于MIT许可证开源的音乐生成工具,通过运行,部署在RunPod无服务器服务上。
需要在文件中配置和(运行命令可创建端点)。
tools/music_gen.py.envRUNPOD_API_KEYRUNPOD_ACESTEP_ENDPOINT_ID--setupQuick Reference
快速参考
bash
undefinedbash
undefinedBasic generation
Basic generation
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3
With musical control
With musical control
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3
Scene presets (video production)
Scene presets (video production)
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
python tools/music_gen.py --preset tension --duration 20 --output problem.mp3
python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
python tools/music_gen.py --preset tension --duration 20 --output problem.mp3
python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3
Vocals with lyrics
Vocals with lyrics
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3
Cover / style transfer
Cover / style transfer
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3
Stem extraction
Stem extraction
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3
List presets
List presets
python tools/music_gen.py --list-presets
undefinedpython tools/music_gen.py --list-presets
undefinedCreating a Song (Step by Step)
创作歌曲(分步指南)
1. Instrumental background track (simplest)
1. 器乐伴奏轨(最简单的场景)
bash
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3bash
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp32. Song with vocals and lyrics
2. 带歌词和人声的歌曲
Write lyrics in a temp file or pass inline. Use structure tags to control song sections.
bash
undefined可以将歌词写入临时文件,或直接内联传入。使用结构标签来控制歌曲的段落结构。
bash
undefinedWrite lyrics to a file first (recommended for longer songs)
Write lyrics to a file first (recommended for longer songs)
cat > /tmp/lyrics.txt << 'LYRICS'
[Verse 1]
Walking through the morning light
Coffee in my hand feels right
Another day to build and dream
Nothing's ever what it seems
[Chorus - anthemic]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Verse 2]
Screens are glowing late at night
Shipping code until it's right
The deadline's close but so are we
Almost there, just wait and see
[Chorus - bigger]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Outro - fade]
(Moving forward...)
LYRICS
cat > /tmp/lyrics.txt << 'LYRICS'
[Verse 1]
Walking through the morning light
Coffee in my hand feels right
Another day to build and dream
Nothing's ever what it seems
[Chorus - anthemic]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Verse 2]
Screens are glowing late at night
Shipping code until it's right
The deadline's close but so are we
Almost there, just wait and see
[Chorus - bigger]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Outro - fade]
(Moving forward...)
LYRICS
Generate the song
Generate the song
python tools/music_gen.py
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
undefinedpython tools/music_gen.py
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3
undefined3. Using a preset for video background
3. 使用预设生成视频背景音乐
bash
python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3bash
python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3Key tips for good results
生成优质结果的核心技巧
- Caption = overall style (genre, instruments, mood, production quality)
- Lyrics = temporal structure (verse/chorus flow, vocal delivery)
- UPPERCASE in lyrics = high vocal intensity
- Parentheses = background vocals: "We rise (together)"
- Keep 6-10 syllables per line for natural rhythm
- Don't describe the melody in the caption — describe the sound and feeling
- Use to lock randomness when iterating on prompt/lyrics
--seed
- 提示词 = 整体风格(流派、乐器、情绪、制作质量)
- 歌词 = 时间结构(主歌/副歌流程、人声表现)
- 歌词中的大写内容 = 高人声强度
- 括号内容 = 背景人声:"We rise (together)"
- 每行保持6-10个音节,节奏更自然
- 不要在提示词中描述旋律,要描述声音和感受
- 使用参数 可以在迭代调整提示词/歌词时固定随机种子
--seed
Scene Presets
场景预设
| Preset | BPM | Key | Use Case |
|---|---|---|---|
| 110 | C Major | Professional background, presentations |
| 128 | G Major | Product launches, tech demos |
| 72 | D Major | Overview slides, reflective content |
| 90 | D Minor | Reveals, announcements |
| 85 | A Minor | Problem statements, challenges |
| 120 | C Major | Solution reveals, resolutions |
| 135 | E Major | Call to action, closing energy |
| 85 | F Major | Screen recordings, coding demos |
| 预设名 | BPM | 调式 | 适用场景 |
|---|---|---|---|
| 110 | C Major | 专业背景音、演示文稿 |
| 128 | G Major | 产品发布、科技演示 |
| 72 | D Major | 概览幻灯片、反思类内容 |
| 90 | D Minor | 内容揭晓、公告 |
| 85 | A Minor | 问题陈述、挑战展示 |
| 120 | C Major | 方案揭晓、问题解决 |
| 135 | E Major | 行动号召、收尾高能片段 |
| 85 | F Major | 录屏、代码演示 |
Task Types
任务类型
text2music (default)
text2music(默认)
Generate music from text prompt + optional lyrics.
从文本提示词生成音乐,可选择传入歌词。
cover
cover
Style transfer from reference audio. Control blend with (0.0-1.0):
--cover-strength- 0.2 — Loose style inspiration (more creative freedom)
- 0.5 — Balanced style transfer
- 0.7 — Close to original structure (default)
- 1.0 — Maximum fidelity to source
基于参考音频的风格迁移,可通过(0.0-1.0)控制融合程度:
--cover-strength- 0.2 — 松散的风格参考(创作自由度更高)
- 0.5 — 平衡的风格迁移
- 0.7 — 接近原曲结构(默认值)
- 1.0 — 最大程度还原源文件
extract
extract
Stem separation — isolate individual tracks from mixed audio.
Tracks: , , , , , , , , ,
vocalsdrumsbassguitarpianokeyboardstringsbrasswoodwindsother分轨分离 — 从混合音频中提取独立音轨。
支持提取的音轨:(人声)、(鼓)、(贝斯)、(吉他)、(钢琴)、(键盘)、(弦乐)、(铜管)、(木管)、(其他)
vocalsdrumsbassguitarpianokeyboardstringsbrasswoodwindsotherrepaint (future)
repaint(未来功能)
Regenerate a specific time segment within existing audio while preserving the rest.
重新生成现有音频中指定的时间片段,其余部分保持不变。
lego (future, requires base model)
lego(未来功能,需要基础模型)
Generate individual instrument tracks within an existing audio context.
在现有音频语境下生成独立的乐器音轨。
complete (future, requires base model)
complete(未来功能,需要基础模型)
Extend partial compositions by adding specified instruments.
通过添加指定乐器拓展未完成的作曲。
Prompt Engineering
提示词工程
Caption Writing — Layer Dimensions
提示词撰写 — 分层描述
Write captions by layering multiple descriptive dimensions rather than single-word descriptions.
Dimensions to include:
- Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
- Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
- Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
- Timbre: warm, crisp, airy, punchy, lush, polished, raw
- Era: "80s synth-pop", "modern indie", "classical romantic"
- Production: lo-fi, studio-polished, live recording, cinematic
- Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")
Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production"
Bad: "Sad song"
通过叠加多个描述维度来撰写提示词,不要使用单字描述。
需要包含的维度:
- 流派/风格:pop、rock、jazz、electronic、lo-fi、synthwave、orchestral
- 情绪/氛围:忧郁的、愉悦的、梦幻的、怀旧的、亲密的、紧张的
- 乐器:原声吉他、合成器铺垫、808鼓、弦乐、铜管、钢琴
- 音色:温暖的、清脆的、空灵的、有冲击力的、饱满的、精良制作的、粗糙的
- 年代:"80s synth-pop"、"modern indie"、"classical romantic"
- 制作风格:lo-fi、 studio-polished(精良 studio 制作)、live recording(现场录音)、cinematic(电影感)
- 人声:气息感强的、有力量的、假声、沙哑的、朗诵(或"instrumental"器乐)
优秀示例:"Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production"
负面示例:"Sad song"
Key Principles
核心原则
- Specificity over vagueness — describe instruments, mood, production style
- Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
- Repetition reinforces priority — repeat important elements for emphasis
- Sparse captions = more creative freedom — detailed captions constrain the model
- Use metadata params for BPM/key — don't write "120 BPM" in the caption, use
--bpm 120
- 具体优于模糊 — 描述乐器、情绪、制作风格
- 避免矛盾描述 — 不要同时要求"古典弦乐"和"硬核金属"
- 重复强化优先级 — 重复重要元素来强调
- 简洁提示词 = 更高创作自由度 — 详细的提示词会限制模型的发挥
- 使用元参数设置BPM/调式 — 不要在提示词中写"120 BPM",使用参数
--bpm 120
Lyrics Formatting
歌词格式
Structure tags (use in lyrics, not caption):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]Vocal control (prefix lines or sections):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]Energy indicators:
- UPPERCASE = high intensity ("WE RISE ABOVE")
- Parentheses = background vocals ("We rise (together)")
- Keep 6-10 syllables per line within sections for natural rhythm
Example — Tech Product Jingle:
[Verse]
Build it better, ship it faster
Every feature tells a story
[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page
[Outro - fade]
(Build it better...)结构标签(用于歌词中,不要写在提示词里):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]人声控制(给行或段落加前缀):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]能量标识:
- 大写字母 = 高能量("WE RISE ABOVE")
- 括号 = 背景人声("We rise (together)")
- 每个段落内每行保持6-10个音节,节奏更自然
示例 — 科技产品广告短曲:
[Verse]
Build it better, ship it faster
Every feature tells a story
[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page
[Outro - fade]
(Build it better...)Video Production Integration
视频制作集成
Music for Scene Types
不同场景的适配音乐
| Scene | Preset | Duration | Notes |
|---|---|---|---|
| Title | | 3-5s | Short, mood-setting |
| Problem | | 10-15s | Dark, unsettling |
| Solution | | 10-15s | Relief, optimism |
| Demo | | 30-120s | Non-distracting, matches demo length |
| Stats | | 8-12s | Building credibility |
| CTA | | 5-10s | Maximum energy, punchy |
| Credits | | 5-10s | Gentle fade-out |
| 场景 | 预设 | 时长 | 说明 |
|---|---|---|---|
| 片头 | | 3-5s | 简短,铺垫氛围 |
| 问题陈述 | | 10-15s | 低沉,有不安感 |
| 方案展示 | | 10-15s | 放松,乐观 |
| 演示 | | 30-120s | 无干扰,匹配演示时长 |
| 数据展示 | | 8-12s | 增强可信度 |
| 行动号召 | | 5-10s | 最高能量,有冲击力 |
| 片尾鸣谢 | | 5-10s | 平缓淡出 |
Timing Workflow
时间同步工作流
- Plan scene durations first (from voiceover script)
- Generate music to match:
--duration <scene_seconds> - Music duration is precise (within 0.1s of requested)
- For background music spanning multiple scenes: generate one long track
- 先规划好各场景时长(根据旁白脚本)
- 生成对应时长的音乐:
--duration <scene_seconds> - 音乐时长精度很高(与请求值误差在0.1s以内)
- 跨多场景的背景音乐:生成一个长音轨即可
Combining with Voiceover
与旁白混音
Background music should be mixed at 10-20% volume in Remotion:
tsx
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />For music under narration: use instrumental presets (, , ).
For music-forward scenes (title, CTA): can use higher volume or vocal tracks.
corporate-bgambientlofi背景音乐的音量应在Remotion中混合到10-20%:
tsx
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />旁白下的背景音乐:使用器乐预设(、、)。
音乐为主的场景(片头、行动号召):可以使用更高音量或带人声的音轨。
corporate-bgambientlofiBrand Consistency
品牌一致性
Use to load hints from .
Use to create variations of a brand's sonic identity.
For consistent sound across a project: fix the seed () and vary only duration/prompt.
--brand <name>brands/<name>/brand.json--cover --reference brand_theme.mp3--seed 42使用从加载品牌配置。
使用可以生成品牌声音标识的变体。
项目内保持一致的音效:固定种子(),仅调整时长/提示词即可。
--brand <name>brands/<name>/brand.json--cover --reference brand_theme.mp3--seed 42Technical Details
技术细节
- Output: 48kHz MP3/WAV/FLAC
- Duration range: 10-600 seconds
- BPM range: 30-300
- Inference: ~2-3s on GPU (turbo, 8 steps), ~40-60s on Mac MPS
- Turbo model: 8 steps, no CFG needed, fast and good quality
- Shift parameter: 3.0 recommended for turbo (improves quality)
- 输出格式:48kHz MP3/WAV/FLAC
- 时长范围:10-600秒
- BPM范围:30-300
- 推理速度:GPU上约2-3秒(turbo模式,8步),Mac MPS上约40-60秒
- Turbo模型:8步,无需CFG,速度快质量好
- Shift参数:turbo模式推荐设置为3.0(可提升质量)
When NOT to use ACE-Step
不适用ACE-Step的场景
- Voice cloning — use Qwen3-TTS or ElevenLabs instead
- Sound effects — use ElevenLabs SFX ()
tools/sfx.py - Speech/narration — use voiceover tools, not music gen
- Stem extraction from video — extract audio first with FFmpeg, then use
--extract
- 声音克隆 — 请使用Qwen3-TTS或ElevenLabs
- 音效生成 — 请使用ElevenLabs SFX()
tools/sfx.py - 语音/旁白生成 — 请使用旁白工具,不要用音乐生成工具
- 从视频中提取分轨 — 先使用FFmpeg提取音频,再使用参数
--extract