stepfun-tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

StepFun stepaudio-2.5-tts

StepFun stepaudio-2.5-tts

Generate Chinese / Japanese speech with
stepaudio-2.5-tts
(released 2026-04, verified 2026-04-23). Contextual TTS — emotion and prosody go through natural-language description, not fixed labels.
Companion: for transcription with
stepaudio-2.5-asr
(the sibling model), use the
stepfun-asr
skill — they share an API key but live on different endpoints with different body shapes.
Why this skill exists — StepAudio 2.5 has two non-obvious pitfalls that cost hours if you don't know them:
  1. stepaudio-2.5-tts
    rejects
    voice_label
    (the step-tts-2 way). Emotion/prosody now goes through
    instruction
    (natural-language description, ≤200 chars) and inline
    ()
    parentheses inside the text itself.
  2. Censorship is stricter — anything containing 死 / 消失 / sensitive political terms returns
    censorship_block
    . Your rewrite options are in
    references/migration_from_v2.md
    .
使用
stepaudio-2.5-tts
(2026年4月发布,2026年4月23日验证)生成中文/日文语音。这是一款上下文感知型TTS——情绪和韵律通过自然语言描述来控制,而非固定标签。
配套工具:如需使用同系列模型
stepaudio-2.5-asr
进行语音转文本,请使用
stepfun-asr
技能——二者共享API密钥,但部署在不同的端点,请求体格式也不同。
本技能存在的原因——StepAudio 2.5有两个不明显的陷阱,如果不了解会耗费大量时间:
  1. stepaudio-2.5-tts
    不支持
    voice_label
    (step-tts-2的用法)。现在情绪/韵律需要通过
    instruction
    (自然语言描述,≤200字符)以及文本内部内嵌的
    ()
    括号来控制。
  2. 内容审核更加严格——任何包含“死”/“消失”/敏感政治词汇的内容都会返回
    censorship_block
    。改写方案可参考
    references/migration_from_v2.md

Config and auth

配置与认证

API key lives in
$STEPFUN_API_KEY
(preferred) or
${CLAUDE_PLUGIN_DATA}/config.json
(fallback for cross-session persistence). All bundled scripts try env first, then config.
First-time setup (one-liner):
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF
If the user hasn't set a key, ask them to paste it (don't guess / don't use a placeholder). StepFun API keys are available at https://platform.stepfun.com/ → API Keys. Use a Normal key, not a Plan key (Plan keys are restricted to text models and silently fail on audio endpoints).
API密钥存储在
$STEPFUN_API_KEY
(推荐)或
${CLAUDE_PLUGIN_DATA}/config.json
(跨会话持久化的备选方案)中。所有捆绑脚本都会优先读取环境变量,再读取配置文件。
首次设置(单行命令):
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste key here>"}
EOF
如果用户尚未设置密钥,请让他们粘贴密钥(不要猜测/不要使用占位符)。StepFun API密钥可在https://platform.stepfun.com/ → API Keys获取。请使用Normal密钥,不要使用Plan密钥(Plan密钥仅限制用于文本模型,在音频端点会静默失败)。

Common tasks — decision tree

常见任务——决策树

User wants...ScriptKey detail
Synthesize 1–500 char Chinese with emotion
scripts/tts_generate.py
Use
instruction
for mood,
()
for inline prosody
Synthesize long text (500–1000 char)
scripts/tts_generate.py
1000 char is the hard cap; split at semantic boundaries above that
Batch-generate game/app voice lines
scripts/tts_generate.py --batch <jsonl>
Handle
censorship_block
fallback individually
A/B compare two TTS models
scripts/ab_compare.sh
Compares duration/size across two directories
Migrate from
step-tts-2
see
references/migration_from_v2.md
voice_label.emotion
instruction
rewrite + censorship list
用户需求...脚本关键细节
合成1–500字符带情绪的中文语音
scripts/tts_generate.py
使用
instruction
控制整体语气,
()
控制内嵌韵律
合成长文本(500–1000字符)
scripts/tts_generate.py
1000字符是硬上限;超过该长度需按语义边界拆分
批量生成游戏/应用语音台词
scripts/tts_generate.py --batch <jsonl>
单独处理
censorship_block
的 fallback 逻辑
A/B对比两个TTS模型
scripts/ab_compare.sh
对比两个目录下音频的时长/大小
step-tts-2
迁移
参考
references/migration_from_v2.md
voice_label.emotion
重写为
instruction
+ 审核词汇列表

Starting points

快速入门

  • Synthesize a single line: Run
    python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感"
    . For fine-grained control read the "Contextual TTS" section below.
  • A full migration from
    step-tts-2
    stepaudio-2.5-tts
    : read
    references/migration_from_v2.md
    end-to-end before touching code. It has the
    INSTRUCTION_MAP
    , the SKIP_CENSORED list pattern, and the output-directory-strategy for non-destructive A/B.
  • 合成单条语音:运行
    python3 scripts/tts_generate.py --text "你好" --out /tmp/hello.mp3 --instruction "温暖的希望感"
    。如需精细控制,请阅读下方“上下文感知型TTS”章节。
  • 从step-tts-2全面迁移到stepaudio-2.5-tts:在修改代码前,请完整阅读
    references/migration_from_v2.md
    。其中包含
    INSTRUCTION_MAP
    、SKIP_CENSORED列表模式,以及用于无损A/B测试的输出目录策略。

Contextual TTS — beyond emotion labels

上下文感知型TTS——超越情绪标签

The headline feature of
stepaudio-2.5-tts
is that you stop mapping emotions to fixed tags and start describing what you want in natural language. Two layers:
Global context (
instruction
parameter)
— sets the overall tone for the entire utterance. ≤200 chars. Think of it like giving stage direction to a voice actor.
instruction: "克制的悲伤,语气低沉柔弱,像快要消失一样"
Inline context (
()
parentheses inside
input
)
—句内 directives. Parenthesised content is consumed as directions and is NOT read aloud. Use for precise control of pauses, breath, emphasis, or mid-sentence emotion shifts.
input: "(试探着问)你好吗?(开心地)太好了!(突然沉下来)不过...我快要消失了。"
Examples that worked in practice (from 2026-04-23 verification):
  • instruction: "活泼俏皮,像是在撒娇,带点嘴硬"
    — visibly speeds up delivery vs neutral
  • instruction: "耳语声,气声很重,几乎听不清"
    — produces audible whisper/breath
  • input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"
    — inline directives all respected
What
stepaudio-2.5-tts
will NOT accept
voice_label
parameter. Error:
voice_label is not supported for v2 models
. This is the #1 migration gotcha from step-tts-2.
stepaudio-2.5-tts
的核心特性是不再将情绪映射到固定标签,而是用自然语言描述你想要的效果。分为两层:
全局上下文(
instruction
参数)
——为整个语音设置整体基调。≤200字符。可以理解为给配音演员的舞台指导。
instruction: "克制的悲伤,语气低沉柔弱,像快要消失一样"
内嵌上下文(
input
中的
()
括号)
——句内指令。括号中的内容会被解析为指令,不会被朗读出来。用于精确控制停顿、呼吸、重音或句中情绪转折。
input: "(试探着问)你好吗?(开心地)太好了!(突然沉下来)不过...我快要消失了。"
实际验证有效的示例(2026年4月23日验证):
  • instruction: "活泼俏皮,像是在撒娇,带点嘴硬"
    ——相比中性语气,语速明显加快
  • instruction: "耳语声,气声很重,几乎听不清"
    ——生成可识别的耳语/呼吸声
  • input: "你好(停顿一下)我是蕾格(轻声)今天(加重)的天气真不错。"
    ——所有内嵌指令均被正确执行
stepaudio-2.5-tts
不接受的内容
——
voice_label
参数。错误提示:
voice_label is not supported for v2 models
。这是从step-tts-2迁移时最容易踩的坑。

Common error patterns (real errors, real fixes)

常见错误模式(真实错误与修复方案)

Error responseActual causeFix
"voice_label is not supported for v2 models"
Sent
voice_label
to
stepaudio-2.5-tts
Remove
voice_label
; put the same intent into
instruction
as natural language
"The content you provided or machine outputted is blocked." type: censorship_block
Sensitive word (死 / 消失 / etc.)Rewrite the phrase OR fall back to
step-tts-2
for that specific line (mixed-model is fine)
Silent audio truncation (input > 1000 chars)Hard cap exceededSplit at semantic boundaries; don't truncate mid-sentence
More in
references/known_issues.md
.
错误响应实际原因修复方案
"voice_label is not supported for v2 models"
stepaudio-2.5-tts
传递了
voice_label
参数
删除
voice_label
;将相同意图转换为自然语言写入
instruction
"The content you provided or machine outputted is blocked." type: censorship_block
包含敏感词汇(死/消失等)改写语句 OR 针对该特定台词回退使用
step-tts-2
(混合模型使用是正常的)
音频静默截断(输入超过1000字符)超过硬上限按语义边界拆分;不要在句中截断
更多内容请参考
references/known_issues.md

When to read references

何时阅读参考文档

  • references/api_reference.md
    — exact request/response JSON for
    /v1/audio/speech
    , all fields, error responses. Read when writing raw HTTP calls instead of using the bundled scripts.
  • references/migration_from_v2.md
    — complete playbook for moving a step-tts-2 project to stepaudio-2.5-tts. Has the emotion→instruction rewrite table, the A/B directory strategy, decision checkpoints, and the 2026-04 speed/quality trade-off data (
    stepaudio-2.5-tts
    is ~20% slower than step-tts-2; audible prosody improvement). Read before any migration work.
  • references/known_issues.md
    — censorship patterns, TTS duration inflation, v2-family parameter naming gotcha, 1000-char hard cap. Read when debugging anomalous output or evaluating whether to adopt.
  • references/api_reference.md
    ——
    /v1/audio/speech
    接口的精确请求/响应JSON格式、所有字段及错误响应。当你不使用捆绑脚本而直接编写HTTP调用时,请阅读此文档。
  • references/migration_from_v2.md
    ——将step-tts-2项目迁移到stepaudio-2.5-tts的完整指南。包含情绪→指令的重写对照表、A/B目录策略、决策检查点,以及2026年4月的速度/质量权衡数据(
    stepaudio-2.5-tts
    比step-tts-2慢约20%;但韵律表现有明显提升)。在进行任何迁移工作前,请阅读此文档。
  • references/known_issues.md
    ——审核规则、TTS时长膨胀问题、v2系列参数命名陷阱、1000字符硬上限。当调试异常输出或评估是否采用该模型时,请阅读此文档。

Design invariants (don't break these)

设计原则(请勿违反)

  1. Non-destructive A/B output — when regenerating a corpus with a new model, write to a parallel directory (
    voice/zh_v25/
    ), never overwrite the production corpus. The migration playbook shows why.
  2. Per-line censorship handling — if 2/29 lines get
    censorship_block
    , don't fail the batch. Log the skipped IDs, continue. Mixed-model fallback (step-tts-2 for the skipped 2) is normal.
  3. Don't duplicate voice_label logic in new code — any new TTS code targeting stepaudio-2.5-tts should only use
    instruction
    + inline
    ()
    . Do not write a branch that conditionally emits
    voice_label
    .
  1. 无损A/B输出——当使用新模型重新生成语料时,请写入并行目录(如
    voice/zh_v25/
    ),切勿覆盖生产语料。迁移指南中说明了原因。
  2. 逐行处理审核问题——如果29条台词中有2条被
    censorship_block
    ,不要终止批量任务。记录跳过的ID,继续执行。针对跳过的2条回退使用step-tts-2的混合模型方案是正常的。
  3. 不要在新代码中重复voice_label逻辑——任何针对stepaudio-2.5-tts的新TTS代码都应仅使用
    instruction
    + 内嵌
    ()
    。不要编写条件性输出
    voice_label
    的分支。

Pricing (verified 2026-04-23, volatile)

定价(2026年4月23日验证,价格可能波动)

  • stepaudio-2.5-tts
    contextual synthesis: ~5.8 元 / 万字符
  • Zero-shot voice cloning: ~9.9 元 / 音色
Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.
  • stepaudio-2.5-tts
    上下文合成:约5.8元/万字符
  • 零样本语音克隆:约9.9元/音色