stepfun-asr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

StepFun stepaudio-2.5-asr

Transcribe audio with StepFun's

stepaudio-2.5-asr

(released 2026-04, verified 2026-04-23). Long audio in one call, no chunking — but only if the request hits the right endpoint with the right body shape. The wrong endpoint returns an error that looks identical to "model doesn't exist", which is the #1 reason this skill exists.

Companion: for TTS with
stepaudio-2.5-tts
(the sibling model), use the
stepfun-tts
skill — they share an API key but live on different endpoints with different body shapes.

使用StepFun的

stepaudio-2.5-asr

（2026年4月发布，2026年4月23日验证）进行音频转写。支持单次调用处理长音频，无需分片——但必须确保请求发送到正确的端点并使用正确的请求体格式。发送到错误端点会返回与“模型不存在”完全一致的错误，这正是本技能存在的首要原因。

配套工具：如需使用同系列模型
stepaudio-2.5-tts
进行TTS（文本转语音），请使用
stepfun-tts
技能——二者共享API密钥，但位于不同端点且请求体格式不同。

Why this skill exists — three traps that cost hours

本技能存在的原因——三个耗时陷阱

Wrong endpoint, wrong error.
```
stepaudio-2.5-asr
```
does not live on
```
/v1/audio/transcriptions
```
(that endpoint serves the older
```
step-asr
```
family). It lives on
```
/v1/audio/asr/sse
```
— SSE streaming, JSON body, base64 audio. Sending it to the wrong endpoint returns
```
{"error":{"message":"model stepaudio-2.5-asr not supported"}}
```
, which is identical in structure to a genuinely nonexistent model name. People waste hours filing whitelist tickets.
Plan key vs Normal key, silent failure. StepFun's "Plan" subscription keys (cheap, text-only) cannot call audio endpoints, but the failure manifests as a 4xx with no auth-shaped error message. If your account has a Plan subscription, you need a separate "Normal" key from the same console.
SSE error events are real. Censorship can fire on the ASR side too (rarely). Don't assume only
```
transcript.text.delta
```
and
```
transcript.text.done
```
events arrive — handle
```
type: error
```
events in the stream or you'll silently drop them.

错误端点，错误提示。
```
stepaudio-2.5-asr
```
并不部署在
```
/v1/audio/transcriptions
```
（该端点服务于旧版
```
step-asr
```
系列）。它的端点是
```
/v1/audio/asr/sse
```
——基于SSE流式传输，请求体为JSON格式，音频需编码为base64。发送到错误端点会返回
```
{"error":{"message":"model stepaudio-2.5-asr not supported"}}
```
，其结构与真正的模型不存在错误完全一致。很多人会因此浪费数小时提交白名单申请。
套餐密钥 vs 普通密钥，静默失败。StepFun的“套餐”订阅密钥（低价，仅支持文本）无法调用音频端点，但失败时只会返回4xx状态码，且无认证相关的错误提示。如果你的账户是套餐订阅，需要从同一控制台获取单独的“普通”密钥。
SSE错误事件真实存在。ASR侧也可能触发内容审核（很少见）。不要假设只会收到
```
transcript.text.delta
```
和
```
transcript.text.done
```
事件——必须处理流中的
```
type: error
```
事件，否则会静默丢失这些错误。

Config and auth

配置与认证

API key resolves in this order (fail-fast, no defaults):

```
$STEPFUN_API_KEY
```
environment variable

${CLAUDE_PLUGIN_DATA}/config.json

with

{"api_key": "..."}

(cross-session persistence)

First-time setup:

bash

mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste Normal key here>"}
EOF

If the user has not set a key, ask them to paste it — do not guess or use a placeholder. Get keys at https://platform.stepfun.com/ → API Keys. Use a Normal key, not a Plan key.

API密钥的读取优先级如下（快速失败，无默认值）：

```
$STEPFUN_API_KEY
```
环境变量

${CLAUDE_PLUGIN_DATA}/config.json

文件中的

{"api_key": "..."}

（跨会话持久化）

首次设置：

bash

mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<在此粘贴普通密钥>"}
EOF

如果用户未设置密钥，请让他们粘贴密钥——不要猜测或使用占位符。获取密钥地址：https://platform.stepfun.com/ → API Keys。请使用普通密钥，而非套餐密钥。

Quick start — single file

快速开始——单个文件

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3

Output: plain text transcription on stdout.

For machine-readable output with usage / timing:

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3 --json

For non-Chinese audio:

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language en

The script handles base64 encoding, the nested

{audio: {data, input: {transcription, format}}}

body, SSE parsing, and the misleading-endpoint pitfall. Prefer it over hand-rolled HTTP calls unless integrating into a larger pipeline.

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3

输出：标准输出为纯文本转写结果。

如需包含使用量/计时信息的机器可读输出：

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3 --json

处理非中文音频：

bash

python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language en

该脚本会处理base64编码、嵌套的

{audio: {data, input: {transcription, format}}}

请求体、SSE解析，以及端点错误陷阱。除非要集成到更大的流水线中，否则建议优先使用该脚本，而非手动编写HTTP请求。

Decision table

决策表

Scenario	Action
Short clip (< 5 min), Chinese or English, mp3/wav/ogg/opus	`python3 scripts/asr_transcribe.py audio.mp3`
Long audio (5-30 min)	Same script — 32K context handles it in a single call, no chunking needed
Audio > 30 min	Split with ffmpeg before sending; the API rejects oversized payloads
Need usage/billing data	Add `--json` to capture `usage.input_tokens` / `usage.total_tokens` from `transcript.text.done`
Highly repetitive content (same phrase 5+ times, > 90s)	Cross-validate with `step-asr-1.1` — see repetition hallucination in `references/known_issues.md`
Hit `model stepaudio-2.5-asr not supported`	Wrong endpoint. Switch from `/v1/audio/transcriptions` to `/v1/audio/asr/sse`
Hit silent 4xx auth failure	Verify your key is "Normal" not "Plan" — Plan keys cannot call audio endpoints
Need to write raw HTTP (no Python)	Read `references/api_reference.md` for exact JSON body and SSE event shapes

场景	操作
短音频（<5分钟），中文或英文，mp3/wav/ogg/opus格式	`python3 scripts/asr_transcribe.py audio.mp3`
长音频（5-30分钟）	使用同一脚本——32K上下文窗口支持单次调用处理，无需分片
音频时长>30分钟	使用ffmpeg分割后再发送；API会拒绝过大的请求体
需要使用量/计费数据	添加 `--json` 参数以捕获 `transcript.text.done` 中的 `usage.input_tokens` / `usage.total_tokens`
高重复内容（同一短语重复5次以上，时长>90秒）	使用 `step-asr-1.1` 交叉验证——详见 `references/known_issues.md` 中的重复幻觉问题
遇到 `model stepaudio-2.5-asr not supported` 错误	端点错误。将端点从 `/v1/audio/transcriptions` 切换为 `/v1/audio/asr/sse`
遇到静默4xx认证失败	验证你的密钥是“普通”密钥而非“套餐”密钥——套餐密钥无法调用音频端点
需要编写原生HTTP请求（不使用Python）	阅读 `references/api_reference.md` 获取精确的JSON请求体和SSE事件格式

Supported audio formats

支持的音频格式

The script auto-detects from extension; pass

--format

to override:

Extension	Format flag	Notes
`.mp3`	`mp3`	Most common, default
`.wav`	`wav`	Lossless
`.ogg`	`ogg`	OGG container
`.opus`	`ogg`	Opus codec in OGG container — pass through unchanged
`.pcm`	`pcm`	Raw PCM — also requires `format.rate` , `format.channel` , `format.bits` (see API reference)

For mp4/m4a/webm/etc., transcode to one of the above first via ffmpeg. Production pipelines often pre-transcode everything to OGG/Opus 16kHz mono to minimize base64 payload size.

脚本会根据扩展名自动检测格式；可通过

--format

参数手动指定：

扩展名	格式标识	说明
`.mp3`	`mp3`	最常用格式，默认值
`.wav`	`wav`	无损格式
`.ogg`	`ogg`	OGG容器
`.opus`	`ogg`	OGG容器中的Opus编码——直接传递无需转换
`.pcm`	`pcm`	原始PCM格式——还需指定 `format.rate` 、 `format.channel` 、 `format.bits` （详见API文档）

对于mp4/m4a/webm等格式，请先使用ffmpeg转码为上述格式之一。生产流水线通常会将所有音频预转码为OGG/Opus 16kHz单声道格式，以最小化base64请求体大小。

Capacity and performance (verified 2026-04-23)

容量与性能（2026年4月23日验证）

32K context window — single-call upper limit, no chunking needed for ≤ 30 min audio
~85-101× RTF on long audio (17.4 min audio → 10.4s wall clock)
~5.3× speedup vs step-asr-1.1 at the 100s+ length range
Only ~2× speedup at the 5-15s range — the LLM spin-up cost dominates short clips. If your workload is many short clips, the migration ROI is modest

32K上下文窗口——单次调用上限，≤30分钟音频无需分片
长音频处理速度约为85-101×RTF（17.4分钟音频→10.4秒处理时长）
相比step-asr-1.1，100秒以上音频处理速度提升约5.3倍
5-15秒短音频仅提升约2倍——LLM启动成本在短音频场景占主导。如果你的工作负载是大量短音频，迁移后的投资回报率有限

Common error patterns

常见错误模式

Error response	Actual cause	Fix
`"model stepaudio-2.5-asr not supported"` on `/v1/audio/transcriptions`	Wrong endpoint	Switch to `/v1/audio/asr/sse` (script does this)
Silent 4xx with no auth message	Using a "Plan" key on audio endpoint	Get a "Normal" key from the StepFun console
ASR returns 3-4× expected character count	Repetition hallucination on highly-repetitive audio	Cross-validate with `step-asr-1.1` ; see `references/known_issues.md`
`data: {"type":"error","message":"content blocked..."}` mid-stream	Censorship fired on user-uploaded content	Handle SSE `error` event explicitly; don't assume only `delta` / `done` arrive

More edge cases in

references/known_issues.md

错误响应	实际原因	修复方案
在 `/v1/audio/transcriptions` 端点收到 `"model stepaudio-2.5-asr not supported"`	端点错误	切换到 `/v1/audio/asr/sse` （脚本已自动处理）
静默4xx错误，无认证相关提示	使用套餐密钥调用音频端点	从StepFun控制台获取普通密钥
ASR返回字符数为预期的3-4倍	高重复内容导致的重复幻觉	使用 `step-asr-1.1` 交叉验证；详见 `references/known_issues.md`
流中返回 `data: {"type":"error","message":"content blocked..."}`	用户上传内容触发审核	显式处理SSE的 `error` 事件；不要假设只会收到 `delta` / `done` 事件

更多边缘情况详见

references/known_issues.md

。

Design invariants (do not break)

设计不变量（请勿破坏）

Always pass through SSE — don't try to buffer the response with a non-streaming client. The model emits
```
transcript.text.delta
```
for long audio;
```
transcript.text.done
```
carries the authoritative full text and
```
usage
```
. Reject the SSE format entirely and you'll get nothing.
Take final text from
transcript.text.done.text
— concatenated deltas can drift on edge cases. Deltas are for progressive UI; the
```
done
```
event is the source of truth.
Handle
error
events in the stream — don't treat the SSE stream as if only success events arrive. A blocked-content event mid-stream returns
```
type: error
```
with no
```
done
```
event.
Fail-fast on missing API key — never default to a placeholder or empty string. The script does this; preserve the behavior in any wrapper.

始终透传SSE流——不要尝试用非流式客户端缓存响应。模型会为长音频输出
```
transcript.text.delta
```
事件；
```
transcript.text.done
```
事件包含权威的完整文本和使用量数据。完全拒绝SSE格式将无法获取任何结果。
从
transcript.text.done.text
获取最终文本——拼接delta文本在边缘场景可能出现偏差。delta仅用于渐进式UI；
```
done
```
事件才是可信来源。
处理流中的
error
事件——不要假设SSE流只会返回成功事件。如果内容被拦截，流中会返回
```
type: error
```
事件且无
```
done
```
事件。
缺少API密钥时快速失败——永远不要默认使用占位符或空字符串。脚本已实现此行为；任何封装都需保留该逻辑。

When to read references

何时查阅参考文档

```
references/api_reference.md
```
— exact JSON request body, all fields, all SSE event types, response examples. Read when writing raw HTTP calls instead of using the bundled script.
```
references/known_issues.md
```
— repetition hallucination details, the wrong-endpoint diagnostic trail, Plan-vs-Normal key gotcha, ASR-side censorship handling, pricing opacity. Read when debugging anomalous output or evaluating whether to migrate from
```
step-asr-1.1
```
.

```
references/api_reference.md
```
——精确的JSON请求体、所有字段、所有SSE事件类型、响应示例。当你需要编写原生HTTP请求而非使用捆绑脚本时阅读。
```
references/known_issues.md
```
——重复幻觉细节、错误端点诊断流程、套餐密钥vs普通密钥陷阱、ASR侧审核处理、定价透明度问题。当你调试异常输出或评估是否从
```
step-asr-1.1
```
迁移时阅读。

Pricing (verified 2026-04-23, volatile)

定价（2026年4月23日验证，价格可能变动）

stepaudio-2.5-asr

is in invitation beta as of 2026-04-23 — no public per-minute rate. The

step-asr-1.1

baseline is 2.2 元/小时. The invitation PDF mentions "成本直降 80%" implying ~0.4 元/小时, but this is not yet on the pricing page. Re-verify at https://platform.stepfun.com/docs/zh/guides/pricing/details before quoting to stakeholders.

截至2026年4月23日，

stepaudio-2.5-asr

处于邀请测试阶段——无公开的每分钟费率。

step-asr-1.1

的基准价格为2.2元/小时。邀请文档提到“成本直降80%”，意味着约0.4元/小时，但该价格尚未在定价页面公布。向利益相关方报价前，请访问https://platform.stepfun.com/docs/zh/guides/pricing/details重新核实。

Next Step

下一步建议

ASR transcripts almost always benefit from one of two downstream steps. After producing the transcript, suggest the natural next skill:

Transcription complete. The output is raw text from the model — common next steps:

Options:
A) transcript-fixer — clean up ASR errors (homophones, segmentation, filler words). Recommended if the recording is a real-world conversation, podcast, or interview rather than read-aloud text
B) meeting-minutes-taker — turn the transcript into structured minutes with decisions, action items, and speaker attribution. Recommended if the recording is a meeting
C) No thanks — the raw transcript is what I needed

Skip the suggestion when the user has already specified the downstream tool, or when the transcription was clearly a one-off lookup (e.g., "what does this 15-second clip say?").

ASR转写结果几乎总能从两个下游步骤中受益。生成转写结果后，建议用户选择合适的后续技能：

转写完成。输出为模型生成的原始文本——常见后续步骤：

选项：
A) transcript-fixer — 清理ASR错误（同音字、分段问题、填充词）。如果录音是真实对话、播客或访谈而非朗读文本，推荐使用此选项
B) meeting-minutes-taker — 将转写结果转换为结构化会议纪要，包含决策、行动项和发言人归属。如果录音是会议内容，推荐使用此选项
C) 不用了——原始转写结果已满足我的需求

当用户已指定下游工具，或转写明显是一次性查询（例如“这段15秒的音频说的是什么？”）时，跳过此建议。