stepfun-asr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStepFun stepaudio-2.5-asr
StepFun stepaudio-2.5-asr
Transcribe audio with StepFun's (released 2026-04, verified 2026-04-23). Long audio in one call, no chunking — but only if the request hits the right endpoint with the right body shape. The wrong endpoint returns an error that looks identical to "model doesn't exist", which is the #1 reason this skill exists.
stepaudio-2.5-asrCompanion: for TTS with(the sibling model), use thestepaudio-2.5-ttsskill — they share an API key but live on different endpoints with different body shapes.stepfun-tts
使用StepFun的(2026年4月发布,2026年4月23日验证)进行音频转写。支持单次调用处理长音频,无需分片——但必须确保请求发送到正确的端点并使用正确的请求体格式。发送到错误端点会返回与“模型不存在”完全一致的错误,这正是本技能存在的首要原因。
stepaudio-2.5-asr配套工具:如需使用同系列模型进行TTS(文本转语音),请使用stepaudio-2.5-tts技能——二者共享API密钥,但位于不同端点且请求体格式不同。stepfun-tts
Why this skill exists — three traps that cost hours
本技能存在的原因——三个耗时陷阱
-
Wrong endpoint, wrong error.does not live on
stepaudio-2.5-asr(that endpoint serves the older/v1/audio/transcriptionsfamily). It lives onstep-asr— SSE streaming, JSON body, base64 audio. Sending it to the wrong endpoint returns/v1/audio/asr/sse, which is identical in structure to a genuinely nonexistent model name. People waste hours filing whitelist tickets.{"error":{"message":"model stepaudio-2.5-asr not supported"}} -
Plan key vs Normal key, silent failure. StepFun's "Plan" subscription keys (cheap, text-only) cannot call audio endpoints, but the failure manifests as a 4xx with no auth-shaped error message. If your account has a Plan subscription, you need a separate "Normal" key from the same console.
-
SSE error events are real. Censorship can fire on the ASR side too (rarely). Don't assume onlyand
transcript.text.deltaevents arrive — handletranscript.text.doneevents in the stream or you'll silently drop them.type: error
-
错误端点,错误提示。并不部署在
stepaudio-2.5-asr(该端点服务于旧版/v1/audio/transcriptions系列)。它的端点是step-asr——基于SSE流式传输,请求体为JSON格式,音频需编码为base64。发送到错误端点会返回/v1/audio/asr/sse,其结构与真正的模型不存在错误完全一致。很多人会因此浪费数小时提交白名单申请。{"error":{"message":"model stepaudio-2.5-asr not supported"}} -
套餐密钥 vs 普通密钥,静默失败。StepFun的“套餐”订阅密钥(低价,仅支持文本)无法调用音频端点,但失败时只会返回4xx状态码,且无认证相关的错误提示。如果你的账户是套餐订阅,需要从同一控制台获取单独的“普通”密钥。
-
SSE错误事件真实存在。ASR侧也可能触发内容审核(很少见)。不要假设只会收到和
transcript.text.delta事件——必须处理流中的transcript.text.done事件,否则会静默丢失这些错误。type: error
Config and auth
配置与认证
API key resolves in this order (fail-fast, no defaults):
- environment variable
$STEPFUN_API_KEY - with
${CLAUDE_PLUGIN_DATA}/config.json(cross-session persistence){"api_key": "..."}
First-time setup:
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<paste Normal key here>"}
EOFIf the user has not set a key, ask them to paste it — do not guess or use a placeholder. Get keys at https://platform.stepfun.com/ → API Keys. Use a Normal key, not a Plan key.
API密钥的读取优先级如下(快速失败,无默认值):
- 环境变量
$STEPFUN_API_KEY - 文件中的
${CLAUDE_PLUGIN_DATA}/config.json(跨会话持久化){"api_key": "..."}
首次设置:
bash
mkdir -p "${CLAUDE_PLUGIN_DATA}" && cat > "${CLAUDE_PLUGIN_DATA}/config.json" <<EOF
{"api_key": "<在此粘贴普通密钥>"}
EOF如果用户未设置密钥,请让他们粘贴密钥——不要猜测或使用占位符。获取密钥地址:https://platform.stepfun.com/ → API Keys。请使用普通密钥,而非套餐密钥。
Quick start — single file
快速开始——单个文件
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3Output: plain text transcription on stdout.
For machine-readable output with usage / timing:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --jsonFor non-Chinese audio:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language enThe script handles base64 encoding, the nested body, SSE parsing, and the misleading-endpoint pitfall. Prefer it over hand-rolled HTTP calls unless integrating into a larger pipeline.
{audio: {data, input: {transcription, format}}}bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3输出:标准输出为纯文本转写结果。
如需包含使用量/计时信息的机器可读输出:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --json处理非中文音频:
bash
python3 scripts/asr_transcribe.py /path/to/audio.mp3 --language en该脚本会处理base64编码、嵌套的请求体、SSE解析,以及端点错误陷阱。除非要集成到更大的流水线中,否则建议优先使用该脚本,而非手动编写HTTP请求。
{audio: {data, input: {transcription, format}}}Decision table
决策表
| Scenario | Action |
|---|---|
| Short clip (< 5 min), Chinese or English, mp3/wav/ogg/opus | |
| Long audio (5-30 min) | Same script — 32K context handles it in a single call, no chunking needed |
| Audio > 30 min | Split with ffmpeg before sending; the API rejects oversized payloads |
| Need usage/billing data | Add |
| Highly repetitive content (same phrase 5+ times, > 90s) | Cross-validate with |
Hit | Wrong endpoint. Switch from |
| Hit silent 4xx auth failure | Verify your key is "Normal" not "Plan" — Plan keys cannot call audio endpoints |
| Need to write raw HTTP (no Python) | Read |
| 场景 | 操作 |
|---|---|
| 短音频(<5分钟),中文或英文,mp3/wav/ogg/opus格式 | |
| 长音频(5-30分钟) | 使用同一脚本——32K上下文窗口支持单次调用处理,无需分片 |
| 音频时长>30分钟 | 使用ffmpeg分割后再发送;API会拒绝过大的请求体 |
| 需要使用量/计费数据 | 添加 |
| 高重复内容(同一短语重复5次以上,时长>90秒) | 使用 |
遇到 | 端点错误。将端点从 |
| 遇到静默4xx认证失败 | 验证你的密钥是“普通”密钥而非“套餐”密钥——套餐密钥无法调用音频端点 |
| 需要编写原生HTTP请求(不使用Python) | 阅读 |
Supported audio formats
支持的音频格式
The script auto-detects from extension; pass to override:
--format| Extension | Format flag | Notes |
|---|---|---|
| | Most common, default |
| | Lossless |
| | OGG container |
| | Opus codec in OGG container — pass through unchanged |
| | Raw PCM — also requires |
For mp4/m4a/webm/etc., transcode to one of the above first via ffmpeg. Production pipelines often pre-transcode everything to OGG/Opus 16kHz mono to minimize base64 payload size.
脚本会根据扩展名自动检测格式;可通过参数手动指定:
--format| 扩展名 | 格式标识 | 说明 |
|---|---|---|
| | 最常用格式,默认值 |
| | 无损格式 |
| | OGG容器 |
| | OGG容器中的Opus编码——直接传递无需转换 |
| | 原始PCM格式——还需指定 |
对于mp4/m4a/webm等格式,请先使用ffmpeg转码为上述格式之一。生产流水线通常会将所有音频预转码为OGG/Opus 16kHz单声道格式,以最小化base64请求体大小。
Capacity and performance (verified 2026-04-23)
容量与性能(2026年4月23日验证)
- 32K context window — single-call upper limit, no chunking needed for ≤ 30 min audio
- ~85-101× RTF on long audio (17.4 min audio → 10.4s wall clock)
- ~5.3× speedup vs step-asr-1.1 at the 100s+ length range
- Only ~2× speedup at the 5-15s range — the LLM spin-up cost dominates short clips. If your workload is many short clips, the migration ROI is modest
- 32K上下文窗口——单次调用上限,≤30分钟音频无需分片
- 长音频处理速度约为85-101×RTF(17.4分钟音频→10.4秒处理时长)
- 相比step-asr-1.1,100秒以上音频处理速度提升约5.3倍
- 5-15秒短音频仅提升约2倍——LLM启动成本在短音频场景占主导。如果你的工作负载是大量短音频,迁移后的投资回报率有限
Common error patterns
常见错误模式
| Error response | Actual cause | Fix |
|---|---|---|
| Wrong endpoint | Switch to |
| Silent 4xx with no auth message | Using a "Plan" key on audio endpoint | Get a "Normal" key from the StepFun console |
| ASR returns 3-4× expected character count | Repetition hallucination on highly-repetitive audio | Cross-validate with |
| Censorship fired on user-uploaded content | Handle SSE |
More edge cases in .
references/known_issues.md| 错误响应 | 实际原因 | 修复方案 |
|---|---|---|
在 | 端点错误 | 切换到 |
| 静默4xx错误,无认证相关提示 | 使用套餐密钥调用音频端点 | 从StepFun控制台获取普通密钥 |
| ASR返回字符数为预期的3-4倍 | 高重复内容导致的重复幻觉 | 使用 |
流中返回 | 用户上传内容触发审核 | 显式处理SSE的 |
更多边缘情况详见。
references/known_issues.mdDesign invariants (do not break)
设计不变量(请勿破坏)
- Always pass through SSE — don't try to buffer the response with a non-streaming client. The model emits for long audio;
transcript.text.deltacarries the authoritative full text andtranscript.text.done. Reject the SSE format entirely and you'll get nothing.usage - Take final text from — concatenated deltas can drift on edge cases. Deltas are for progressive UI; the
transcript.text.done.textevent is the source of truth.done - Handle events in the stream — don't treat the SSE stream as if only success events arrive. A blocked-content event mid-stream returns
errorwith notype: errorevent.done - Fail-fast on missing API key — never default to a placeholder or empty string. The script does this; preserve the behavior in any wrapper.
- 始终透传SSE流——不要尝试用非流式客户端缓存响应。模型会为长音频输出事件;
transcript.text.delta事件包含权威的完整文本和使用量数据。完全拒绝SSE格式将无法获取任何结果。transcript.text.done - 从获取最终文本——拼接delta文本在边缘场景可能出现偏差。delta仅用于渐进式UI;
transcript.text.done.text事件才是可信来源。done - 处理流中的事件——不要假设SSE流只会返回成功事件。如果内容被拦截,流中会返回
error事件且无type: error事件。done - 缺少API密钥时快速失败——永远不要默认使用占位符或空字符串。脚本已实现此行为;任何封装都需保留该逻辑。
When to read references
何时查阅参考文档
- — exact JSON request body, all fields, all SSE event types, response examples. Read when writing raw HTTP calls instead of using the bundled script.
references/api_reference.md - — repetition hallucination details, the wrong-endpoint diagnostic trail, Plan-vs-Normal key gotcha, ASR-side censorship handling, pricing opacity. Read when debugging anomalous output or evaluating whether to migrate from
references/known_issues.md.step-asr-1.1
- ——精确的JSON请求体、所有字段、所有SSE事件类型、响应示例。当你需要编写原生HTTP请求而非使用捆绑脚本时阅读。
references/api_reference.md - ——重复幻觉细节、错误端点诊断流程、套餐密钥vs普通密钥陷阱、ASR侧审核处理、定价透明度问题。当你调试异常输出或评估是否从
references/known_issues.md迁移时阅读。step-asr-1.1
Pricing (verified 2026-04-23, volatile)
定价(2026年4月23日验证,价格可能变动)
stepaudio-2.5-asrstep-asr-1.1截至2026年4月23日,处于邀请测试阶段——无公开的每分钟费率。的基准价格为2.2元/小时。邀请文档提到“成本直降80%”,意味着约0.4元/小时,但该价格尚未在定价页面公布。向利益相关方报价前,请访问https://platform.stepfun.com/docs/zh/guides/pricing/details重新核实。
stepaudio-2.5-asrstep-asr-1.1Next Step
下一步建议
ASR transcripts almost always benefit from one of two downstream steps. After producing the transcript, suggest the natural next skill:
Transcription complete. The output is raw text from the model — common next steps:
Options:
A) transcript-fixer — clean up ASR errors (homophones, segmentation, filler words). Recommended if the recording is a real-world conversation, podcast, or interview rather than read-aloud text
B) meeting-minutes-taker — turn the transcript into structured minutes with decisions, action items, and speaker attribution. Recommended if the recording is a meeting
C) No thanks — the raw transcript is what I neededSkip the suggestion when the user has already specified the downstream tool, or when the transcription was clearly a one-off lookup (e.g., "what does this 15-second clip say?").
ASR转写结果几乎总能从两个下游步骤中受益。生成转写结果后,建议用户选择合适的后续技能:
转写完成。输出为模型生成的原始文本——常见后续步骤:
选项:
A) transcript-fixer — 清理ASR错误(同音字、分段问题、填充词)。如果录音是真实对话、播客或访谈而非朗读文本,推荐使用此选项
B) meeting-minutes-taker — 将转写结果转换为结构化会议纪要,包含决策、行动项和发言人归属。如果录音是会议内容,推荐使用此选项
C) 不用了——原始转写结果已满足我的需求当用户已指定下游工具,或转写明显是一次性查询(例如“这段15秒的音频说的是什么?”)时,跳过此建议。