acestep

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ACE-Step 1.5 Music Generation

ACE-Step 1.5 音乐生成

Open-source music generation (MIT license) via

tools/music_gen.py

. Runs on RunPod serverless. Requires

RUNPOD_API_KEY

and

RUNPOD_ACESTEP_ENDPOINT_ID

.env

(run

--setup

to create endpoint).

基于MIT许可证开源的音乐生成工具，通过

tools/music_gen.py

运行，部署在RunPod无服务器服务上。需要在

.env

文件中配置

RUNPOD_API_KEY

和

RUNPOD_ACESTEP_ENDPOINT_ID

（运行

--setup

命令可创建端点）。

Quick Reference

快速参考

bash

undefined

bash

undefined

Basic generation

python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3

With musical control

python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3

Scene presets (video production)

python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3 python tools/music_gen.py --preset tension --duration 20 --output problem.mp3 python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3

Vocals with lyrics

python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3

Cover / style transfer

python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3

Stem extraction

python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3

List presets

python tools/music_gen.py --list-presets

undefined

python tools/music_gen.py --list-presets

undefined

Creating a Song (Step by Step)

创作歌曲（分步指南）

1. Instrumental background track (simplest)

1. 器乐伴奏轨（最简单的场景）

bash

python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3

bash

python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3

2. Song with vocals and lyrics

2. 带歌词和人声的歌曲

Write lyrics in a temp file or pass inline. Use structure tags to control song sections.

bash

undefined

可以将歌词写入临时文件，或直接内联传入。使用结构标签来控制歌曲的段落结构。

bash

undefined

Write lyrics to a file first (recommended for longer songs)

cat > /tmp/lyrics.txt << 'LYRICS' [Verse 1] Walking through the morning light Coffee in my hand feels right Another day to build and dream Nothing's ever what it seems

[Chorus - anthemic] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about

[Verse 2] Screens are glowing late at night Shipping code until it's right The deadline's close but so are we Almost there, just wait and see

[Chorus - bigger] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about

[Outro - fade] (Moving forward...) LYRICS

cat > /tmp/lyrics.txt << 'LYRICS' [Verse 1] Walking through the morning light Coffee in my hand feels right Another day to build and dream Nothing's ever what it seems

[Chorus - anthemic] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about

[Verse 2] Screens are glowing late at night Shipping code until it's right The deadline's close but so are we Almost there, just wait and see

[Chorus - bigger] WE KEEP MOVING FORWARD Through the noise and doubt We keep moving forward That's what it's about

[Outro - fade] (Moving forward...) LYRICS

Generate the song

python tools/music_gen.py
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish"
--lyrics "$(cat /tmp/lyrics.txt)"
--duration 60
--bpm 128
--key "G Major"
--output my_song.mp3

undefined

undefined

3. Using a preset for video background

3. 使用预设生成视频背景音乐

bash

python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3

bash

python tools/music_gen.py --preset tension --duration 20 --output problem_scene.mp3

Key tips for good results

生成优质结果的核心技巧

Caption = overall style (genre, instruments, mood, production quality)
Lyrics = temporal structure (verse/chorus flow, vocal delivery)
UPPERCASE in lyrics = high vocal intensity
Parentheses = background vocals: "We rise (together)"
Keep 6-10 syllables per line for natural rhythm
Don't describe the melody in the caption — describe the sound and feeling
Use
--seed
to lock randomness when iterating on prompt/lyrics

提示词 = 整体风格（流派、乐器、情绪、制作质量）
歌词 = 时间结构（主歌/副歌流程、人声表现）
歌词中的大写内容 = 高人声强度
括号内容 = 背景人声："We rise (together)"
每行保持6-10个音节，节奏更自然
不要在提示词中描述旋律，要描述声音和感受
使用
--seed
参数可以在迭代调整提示词/歌词时固定随机种子

Scene Presets

场景预设

Preset	BPM	Key	Use Case
`corporate-bg`	110	C Major	Professional background, presentations
`upbeat-tech`	128	G Major	Product launches, tech demos
`ambient`	72	D Major	Overview slides, reflective content
`dramatic`	90	D Minor	Reveals, announcements
`tension`	85	A Minor	Problem statements, challenges
`hopeful`	120	C Major	Solution reveals, resolutions
`cta`	135	E Major	Call to action, closing energy
`lofi`	85	F Major	Screen recordings, coding demos

预设名	BPM	调式	适用场景
`corporate-bg`	110	C Major	专业背景音、演示文稿
`upbeat-tech`	128	G Major	产品发布、科技演示
`ambient`	72	D Major	概览幻灯片、反思类内容
`dramatic`	90	D Minor	内容揭晓、公告
`tension`	85	A Minor	问题陈述、挑战展示
`hopeful`	120	C Major	方案揭晓、问题解决
`cta`	135	E Major	行动号召、收尾高能片段
`lofi`	85	F Major	录屏、代码演示

Task Types

任务类型

text2music (default)

text2music（默认）

Generate music from text prompt + optional lyrics.

从文本提示词生成音乐，可选择传入歌词。

cover

Style transfer from reference audio. Control blend with

--cover-strength

(0.0-1.0):

0.2 — Loose style inspiration (more creative freedom)
0.5 — Balanced style transfer
0.7 — Close to original structure (default)
1.0 — Maximum fidelity to source

基于参考音频的风格迁移，可通过

--cover-strength

（0.0-1.0）控制融合程度：

0.2 — 松散的风格参考（创作自由度更高）
0.5 — 平衡的风格迁移
0.7 — 接近原曲结构（默认值）
1.0 — 最大程度还原源文件

extract

Stem separation — isolate individual tracks from mixed audio. Tracks:

vocals

drums

bass

guitar

piano

keyboard

strings

brass

woodwinds

other

分轨分离 — 从混合音频中提取独立音轨。支持提取的音轨：

vocals

（人声）、

drums

（鼓）、

bass

（贝斯）、

guitar

（吉他）、

piano

（钢琴）、

keyboard

（键盘）、

strings

（弦乐）、

brass

（铜管）、

woodwinds

（木管）、

other

（其他）

repaint (future)

repaint（未来功能）

Regenerate a specific time segment within existing audio while preserving the rest.

重新生成现有音频中指定的时间片段，其余部分保持不变。

lego (future, requires base model)

lego（未来功能，需要基础模型）

Generate individual instrument tracks within an existing audio context.

在现有音频语境下生成独立的乐器音轨。

complete (future, requires base model)

complete（未来功能，需要基础模型）

Extend partial compositions by adding specified instruments.

通过添加指定乐器拓展未完成的作曲。

Prompt Engineering

提示词工程

Caption Writing — Layer Dimensions

提示词撰写 — 分层描述

Write captions by layering multiple descriptive dimensions rather than single-word descriptions.

Dimensions to include:

Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
Timbre: warm, crisp, airy, punchy, lush, polished, raw
Era: "80s synth-pop", "modern indie", "classical romantic"
Production: lo-fi, studio-polished, live recording, cinematic
Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")

Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" Bad: "Sad song"

通过叠加多个描述维度来撰写提示词，不要使用单字描述。

需要包含的维度：

流派/风格：pop、rock、jazz、electronic、lo-fi、synthwave、orchestral
情绪/氛围：忧郁的、愉悦的、梦幻的、怀旧的、亲密的、紧张的
乐器：原声吉他、合成器铺垫、808鼓、弦乐、铜管、钢琴
音色：温暖的、清脆的、空灵的、有冲击力的、饱满的、精良制作的、粗糙的
年代："80s synth-pop"、"modern indie"、"classical romantic"
制作风格：lo-fi、 studio-polished（精良 studio 制作）、live recording（现场录音）、cinematic（电影感）
人声：气息感强的、有力量的、假声、沙哑的、朗诵（或"instrumental"器乐）

优秀示例："Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" 负面示例："Sad song"

Key Principles

核心原则

Specificity over vagueness — describe instruments, mood, production style
Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
Repetition reinforces priority — repeat important elements for emphasis
Sparse captions = more creative freedom — detailed captions constrain the model
Use metadata params for BPM/key — don't write "120 BPM" in the caption, use
```
--bpm 120
```

具体优于模糊 — 描述乐器、情绪、制作风格
避免矛盾描述 — 不要同时要求"古典弦乐"和"硬核金属"
重复强化优先级 — 重复重要元素来强调
简洁提示词 = 更高创作自由度 — 详细的提示词会限制模型的发挥
使用元参数设置BPM/调式 — 不要在提示词中写"120 BPM"，使用
```
--bpm 120
```
参数

Lyrics Formatting

歌词格式

Structure tags (use in lyrics, not caption):

[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]

Vocal control (prefix lines or sections):

[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]

Energy indicators:

UPPERCASE = high intensity ("WE RISE ABOVE")
Parentheses = background vocals ("We rise (together)")
Keep 6-10 syllables per line within sections for natural rhythm

Example — Tech Product Jingle:

[Verse]
Build it better, ship it faster
Every feature tells a story

[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page

[Outro - fade]
(Build it better...)

结构标签（用于歌词中，不要写在提示词里）：

[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]

人声控制（给行或段落加前缀）：

[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]

能量标识：

大写字母 = 高能量（"WE RISE ABOVE"）
括号 = 背景人声（"We rise (together)"）
每个段落内每行保持6-10个音节，节奏更自然

示例 — 科技产品广告短曲：

[Verse]
Build it better, ship it faster
Every feature tells a story

[Chorus - anthemic]
THIS IS YOUR PLATFORM
Your vision, your stage
Digital Samba, every page

[Outro - fade]
(Build it better...)

Video Production Integration

视频制作集成

Music for Scene Types

不同场景的适配音乐

Scene	Preset	Duration	Notes
Title	`dramatic` or `ambient`	3-5s	Short, mood-setting
Problem	`tension`	10-15s	Dark, unsettling
Solution	`hopeful`	10-15s	Relief, optimism
Demo	`lofi` or `corporate-bg`	30-120s	Non-distracting, matches demo length
Stats	`upbeat-tech`	8-12s	Building credibility
CTA	`cta`	5-10s	Maximum energy, punchy
Credits	`ambient`	5-10s	Gentle fade-out

场景	预设	时长	说明
片头	`dramatic` 或 `ambient`	3-5s	简短，铺垫氛围
问题陈述	`tension`	10-15s	低沉，有不安感
方案展示	`hopeful`	10-15s	放松，乐观
演示	`lofi` 或 `corporate-bg`	30-120s	无干扰，匹配演示时长
数据展示	`upbeat-tech`	8-12s	增强可信度
行动号召	`cta`	5-10s	最高能量，有冲击力
片尾鸣谢	`ambient`	5-10s	平缓淡出

Timing Workflow

时间同步工作流

Plan scene durations first (from voiceover script)
Generate music to match:
```
--duration <scene_seconds>
```
Music duration is precise (within 0.1s of requested)
For background music spanning multiple scenes: generate one long track

先规划好各场景时长（根据旁白脚本）
生成对应时长的音乐：
```
--duration <scene_seconds>
```
音乐时长精度很高（与请求值误差在0.1s以内）
跨多场景的背景音乐：生成一个长音轨即可

Combining with Voiceover

与旁白混音

Background music should be mixed at 10-20% volume in Remotion:

tsx

<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />

For music under narration: use instrumental presets (

corporate-bg

ambient

lofi

). For music-forward scenes (title, CTA): can use higher volume or vocal tracks.

背景音乐的音量应在Remotion中混合到10-20%：

tsx

<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />

旁白下的背景音乐：使用器乐预设（

corporate-bg

、

ambient

、

lofi

）。音乐为主的场景（片头、行动号召）：可以使用更高音量或带人声的音轨。

Brand Consistency

品牌一致性

Use

--brand <name>

to load hints from

brands/<name>/brand.json

. Use

--cover --reference brand_theme.mp3

to create variations of a brand's sonic identity. For consistent sound across a project: fix the seed (

--seed 42

) and vary only duration/prompt.

使用

--brand <name>

从

brands/<name>/brand.json

加载品牌配置。使用

--cover --reference brand_theme.mp3

可以生成品牌声音标识的变体。项目内保持一致的音效：固定种子（

--seed 42

），仅调整时长/提示词即可。

Technical Details

技术细节

Output: 48kHz MP3/WAV/FLAC
Duration range: 10-600 seconds
BPM range: 30-300
Inference: ~2-3s on GPU (turbo, 8 steps), ~40-60s on Mac MPS
Turbo model: 8 steps, no CFG needed, fast and good quality
Shift parameter: 3.0 recommended for turbo (improves quality)

输出格式：48kHz MP3/WAV/FLAC
时长范围：10-600秒
BPM范围：30-300
推理速度：GPU上约2-3秒（turbo模式，8步），Mac MPS上约40-60秒
Turbo模型：8步，无需CFG，速度快质量好
Shift参数：turbo模式推荐设置为3.0（可提升质量）

When NOT to use ACE-Step

不适用ACE-Step的场景

Voice cloning — use Qwen3-TTS or ElevenLabs instead
Sound effects — use ElevenLabs SFX (
```
tools/sfx.py
```
)
Speech/narration — use voiceover tools, not music gen
Stem extraction from video — extract audio first with FFmpeg, then use
```
--extract
```

声音克隆 — 请使用Qwen3-TTS或ElevenLabs
音效生成 — 请使用ElevenLabs SFX（
```
tools/sfx.py
```
）
语音/旁白生成 — 请使用旁白工具，不要用音乐生成工具
从视频中提取分轨 — 先使用FFmpeg提取音频，再使用
```
--extract
```
参数