stepfun-tts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

StepFun stepaudio-2.5-tts

Generate Chinese / Japanese speech with

stepaudio-2.5-tts

(released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.

Companion: for transcription with
stepaudio-2.5-asr
(the sibling model), use the
stepfun-asr
skill — they share an API key but live on different endpoints with different body shapes.

Why this skill exists — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them:

```
stepaudio-2.5-tts
```
rejects
```
voice_label
```
(the step-tts-2 way). Emotion/prosody now goes through
```
instruction
```
(natural-language description, ≤200 chars) and inline
```
()
```
parentheses inside the text itself.
Censorship is stricter — anything containing 死 / 消失 / sensitive political terms returns
```
censorship_block
```
. Your rewrite options are in
```
references/migration_from_v2.md
```
.

使用

stepaudio-2.5-tts

（2026年4月发布，2026年4月23日验证）生成中文/日文语音。这是一款上下文感知型TTS——情绪和韵律通过自然语言描述来控制，而非固定标签。

配套工具：如需使用同系列模型
stepaudio-2.5-asr
进行语音转文本，请使用
stepfun-asr
技能——二者共享API密钥，但部署在不同的端点，请求体格式也不同。

本技能存在的原因——StepAudio 2.5有两个不明显的陷阱，如果不了解会耗费大量时间：

```
stepaudio-2.5-tts
```
不支持
```
voice_label
```
（step-tts-2的用法）。现在情绪/韵律需要通过
```
instruction
```
（自然语言描述，≤200字符）以及文本内部内嵌的
```
()
```
括号来控制。
内容审核更加严格——任何包含“死”/“消失”/敏感政治词汇的内容都会返回
```
censorship_block
```
。改写方案可参考
```
references/migration_from_v2.md
```
。

Config and auth

配置与认证

API key lives in

$STEPFUN_API_KEY

(preferred) or

${CLAUDE_PLUGIN_DATA}/config.json

(fallback for cross-session persistence). All bundled scripts try env first, then config.

First-time setup (one-liner):

bash

mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF

If the user hasn't set a key, ask them to paste it (don't guess / don't use a placeholder). StepFun API keys are available at https://platform.stepfun.com/ → API Keys. Use a Normal key, not a Plan key (Plan keys are restricted to text models and silently fail on audio endpoints).

API密钥存储在

$STEPFUN_API_KEY

（推荐）或

${CLAUDE_PLUGIN_DATA}/config.json

（跨会话持久化的备选方案）中。所有捆绑脚本都会优先读取环境变量，再读取配置文件。

首次设置（单行命令）：

bash

mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF

如果用户尚未设置密钥，请让他们粘贴密钥（不要猜测/不要使用占位符）。StepFun API密钥可在https://platform.stepfun.com/ → API Keys获取。请使用Normal密钥，不要使用Plan密钥（Plan密钥仅限制用于文本模型，在音频端点会静默失败）。

Common tasks — decision tree

常见任务——决策树

User wants...	Script	Key detail
Synthesize 1–500 char Chinese with emotion	`scripts/tts_generate.py`	Use `instruction` for mood, `()` for inline prosody
Synthesize long text (500–1000 char)	`scripts/tts_generate.py`	1000 char is the hard cap; split at semantic boundaries above that
Batch-generate game/app voice lines	`scripts/tts_generate.py --batch <jsonl>`	Handle `censorship_block` fallback individually
A/B compare two TTS models	`scripts/ab_compare.sh`	Compares duration/size across two directories
Migrate from `step-tts-2`	see `references/migration_from_v2.md`	`voice_label.emotion` → `instruction` rewrite + censorship list

用户需求...	脚本	关键细节
合成1–500字符带情绪的中文语音	`scripts/tts_generate.py`	使用 `instruction` 控制整体语气， `()` 控制内嵌韵律
合成长文本（500–1000字符）	`scripts/tts_generate.py`	1000字符是硬上限；超过该长度需按语义边界拆分
批量生成游戏/应用语音台词	`scripts/tts_generate.py --batch <jsonl>`	单独处理 `censorship_block` 的 fallback 逻辑
A/B对比两个TTS模型	`scripts/ab_compare.sh`	对比两个目录下音频的时长/大小
从 `step-tts-2` 迁移	参考 `references/migration_from_v2.md`	将 `voice_label.emotion` 重写为 `instruction` + 审核词汇列表

Starting points

快速入门

Synthesize a single line: Run

python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感"

. For fine-grained control read the "Contextual TTS" section below.

A full migration from
```
step-tts-2
```
→
```
stepaudio-2.5-tts
```
: read
```
references/migration_from_v2.md
```
end-to-end before touching code. It has the
```
INSTRUCTION_MAP
```
, the SKIP_CENSORED list pattern, and the output-directory-strategy for non-destructive A/B.

合成单条语音：运行

python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感"

。如需精细控制，请阅读下方“上下文感知型TTS”章节。

从step-tts-2全面迁移到stepaudio-2.5-tts：在修改代码前，请完整阅读
```
references/migration_from_v2.md
```
。其中包含
```
INSTRUCTION_MAP
```
、SKIP_CENSORED列表模式，以及用于无损A/B测试的输出目录策略。

Contextual TTS — beyond emotion labels

上下文感知型TTS——超越情绪标签

The headline feature of

stepaudio-2.5-tts

is that you stop mapping emotions to fixed tags and start describing what you want in natural language. Two layers:

Global context (
instruction
parameter) — sets the overall tone for the entire utterance. ≤200 chars. Think of it like giving stage direction to a voice actor.

instruction: "克制的悲伤，语气低沉柔弱，像快要消失一样"

Inline context (
()
parentheses inside
input
) —句内 directives. Parenthesised content is consumed as directions and is NOT read aloud. Use for precise control of pauses, breath, emphasis, or mid-sentence emotion shifts.

input: "(试探着问)你好吗？(开心地)太好了！(突然沉下来)不过...我快要消失了。"

Examples that worked in practice (from 2026-04-23 verification):

instruction: "活泼俏皮，像是在撒娇，带点嘴硬"

— visibly speeds up delivery vs neutral

instruction: "耳语声，气声很重，几乎听不清"

— produces audible whisper/breath

input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"

— inline directives all respected

What
stepaudio-2.5-tts
will NOT accept —

voice_label

parameter. Error:

voice_label is not supported for v2 models

. This is the #1 migration gotcha from step-tts-2.

stepaudio-2.5-tts

的核心特性是不再将情绪映射到固定标签，而是用自然语言描述你想要的效果。分为两层：

全局上下文（
instruction
参数）——为整个语音设置整体基调。≤200字符。可以理解为给配音演员的舞台指导。

instruction: "克制的悲伤，语气低沉柔弱，像快要消失一样"

内嵌上下文（
input
中的
()
括号）——句内指令。括号中的内容会被解析为指令，不会被朗读出来。用于精确控制停顿、呼吸、重音或句中情绪转折。

input: "(试探着问)你好吗？(开心地)太好了！(突然沉下来)不过...我快要消失了。"

实际验证有效的示例（2026年4月23日验证）：

instruction: "活泼俏皮，像是在撒娇，带点嘴硬"

——相比中性语气，语速明显加快

instruction: "耳语声，气声很重，几乎听不清"

——生成可识别的耳语/呼吸声

input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"

——所有内嵌指令均被正确执行

stepaudio-2.5-tts
不接受的内容——

voice_label

参数。错误提示：

voice_label is not supported for v2 models

。这是从step-tts-2迁移时最容易踩的坑。

Common error patterns (real errors, real fixes)

常见错误模式（真实错误与修复方案）

Error response	Actual cause	Fix
`"voice_label is not supported for v2 models"`	Sent `voice_label` to `stepaudio-2.5-tts`	Remove `voice_label` ; put the same intent into `instruction` as natural language
`"The content you provided or machine outputted is blocked." type: censorship_block`	Sensitive word (死 / 消失 / etc.)	Rewrite the phrase OR fall back to `step-tts-2` for that specific line (mixed-model is fine)
Silent audio truncation (input > 1000 chars)	Hard cap exceeded	Split at semantic boundaries; don't truncate mid-sentence

错误响应	实际原因	修复方案
`"voice_label is not supported for v2 models"`	向 `stepaudio-2.5-tts` 传递了 `voice_label` 参数	删除 `voice_label` ；将相同意图转换为自然语言写入 `instruction`
`"The content you provided or machine outputted is blocked." type: censorship_block`	包含敏感词汇（死/消失等）	改写语句 OR 针对该特定台词回退使用 `step-tts-2` （混合模型使用是正常的）
音频静默截断（输入超过1000字符）	超过硬上限	按语义边界拆分；不要在句中截断

When to read references

何时阅读参考文档

```
references/api_reference.md
```
— exact request/response JSON for
```
/v1/audio/speech
```
, all fields, error responses. Read when writing raw HTTP calls instead of using the bundled scripts.
```
references/migration_from_v2.md
```
— complete playbook for moving a step-tts-2 project to stepaudio-2.5-tts. Has the emotion→instruction rewrite table, the A/B directory strategy, decision checkpoints, and the 2026-04 speed/quality trade-off data (
```
stepaudio-2.5-tts
```
is ~20% slower than step-tts-2; audible prosody improvement). Read before any migration work.
```
references/known_issues.md
```
— censorship patterns, TTS duration inflation, v2-family parameter naming gotcha, 1000-char hard cap. Read when debugging anomalous output or evaluating whether to adopt.

```
references/api_reference.md
```
——
```
/v1/audio/speech
```
接口的精确请求/响应JSON格式、所有字段及错误响应。当你不使用捆绑脚本而直接编写HTTP调用时，请阅读此文档。
```
references/migration_from_v2.md
```
——将step-tts-2项目迁移到stepaudio-2.5-tts的完整指南。包含情绪→指令的重写对照表、A/B目录策略、决策检查点，以及2026年4月的速度/质量权衡数据（
```
stepaudio-2.5-tts
```
比step-tts-2慢约20%；但韵律表现有明显提升）。在进行任何迁移工作前，请阅读此文档。
```
references/known_issues.md
```
——审核规则、TTS时长膨胀问题、v2系列参数命名陷阱、1000字符硬上限。当调试异常输出或评估是否采用该模型时，请阅读此文档。

Design invariants (don't break these)

设计原则（请勿违反）

Non-destructive A/B output — when regenerating a corpus with a new model, write to a parallel directory (
```
voice/zh_v25/
```
), never overwrite the production corpus. The migration playbook shows why.
Per-line censorship handling — if 2/29 lines get
```
censorship_block
```
, don't fail the batch. Log the skipped IDs, continue. Mixed-model fallback (step-tts-2 for the skipped 2) is normal.
Don't duplicate voice_label logic in new code — any new TTS code targeting stepaudio-2.5-tts should only use
```
instruction
```
+ inline
```
()
```
. Do not write a branch that conditionally emits
```
voice_label
```
.

无损A/B输出——当使用新模型重新生成语料时，请写入并行目录（如
```
voice/zh_v25/
```
），切勿覆盖生产语料。迁移指南中说明了原因。
逐行处理审核问题——如果29条台词中有2条被
```
censorship_block
```
，不要终止批量任务。记录跳过的ID，继续执行。针对跳过的2条回退使用step-tts-2的混合模型方案是正常的。
不要在新代码中重复voice_label逻辑——任何针对stepaudio-2.5-tts的新TTS代码都应仅使用
```
instruction
```
+ 内嵌
```
()
```
。不要编写条件性输出
```
voice_label
```
的分支。

Pricing (verified 2026-04-23, volatile)

定价（2026年4月23日验证，价格可能波动）

```
stepaudio-2.5-tts
```
contextual synthesis: ~5.8 元 / 万字符
Zero-shot voice cloning: ~9.9 元 / 音色

Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.

```
stepaudio-2.5-tts
```
上下文合成：约5.8元/万字符
零样本语音克隆：约9.9元/音色

在向利益相关方报价前，请前往https://platform.stepfun.com/docs/zh/guides/pricing/details重新确认价格。