speech-to-text

Original：🇺🇸 English

Translated

1 scriptsChecked / no sensitive code detected

Use this skill whenever the user wants to transcribe audio to text, convert speech to text, or get a transcript from an audio or video file. Triggers include: any mention of 'transcribe', 'transcription', 'speech to text', 'STT', 'convert audio to text', 'what does this audio say', 'get transcript', 'subtitle generation', or requests to extract spoken words from a file. Also use when the user wants speaker identification from audio, timestamps for captions, or multilingual transcription.

8installs

Sourcenoizai/skills

Added on2026-05-08

NPX Install

npx skill4agent add noizai/skills speech-to-text

SKILL.md Content

View Translation Comparison →

speech-to-text

Transcribe any audio file to text. Supports multilingual auto-detection, timestamps, and speaker labels.

Triggers

transcribe / transcript / transcription
speech to text / STT / audio to text
what does this audio say / convert audio
转录 / 语音转文字 / 识别音频

Quick Start

bash

# Transcribe with auto language detection
python3 skills/speech-to-text/scripts/stt.py audio.mp3

# Specify language explicitly
python3 skills/speech-to-text/scripts/stt.py interview.wav --language en

# Save transcript to file
python3 skills/speech-to-text/scripts/stt.py podcast.m4a -o transcript.txt

# Output full JSON (with timestamps and speaker labels)
python3 skills/speech-to-text/scripts/stt.py meeting.wav --json -o result.json

Arguments

Argument	Default	Description
`file`	required	Audio file to transcribe (mp3, wav, m4a, ogg, flac, aac, webm). Max 50 MB, max 10 min.
`--language` / `-l`	auto-detect	BCP-47 language code (e.g. `en` , `zh` , `ja` ). Omit to auto-detect.
`--output` / `-o`	stdout	Path to save transcript text (or JSON if `--json` is set).
`--json`	off	Output full JSON response with timestamps and speaker labels.
`--api-key`	from env/config	Noiz API key (overrides stored key).

Output Format

Without

--json

, only the transcript text is printed:

Hello, welcome to today's podcast. We have a special guest joining us...

With

--json

, the full structured response is printed:

json

{
  "language": "en",
  "transcript": "Hello, welcome to today's podcast...",
  "duration": 42.5,
  "segments": [
    {"text": "Hello, welcome to today's podcast.", "start": 0.0, "end": 3.2, "spk": 0},
    {"text": "We have a special guest joining us.", "start": 3.5, "end": 6.1, "spk": 0}
  ]
}

Supported Languages

Common codes:

en

(English),

zh

(Chinese),

ja

(Japanese),

ko

(Korean),

es

(Spanish),

fr

(French),

de

(German),

pt

(Portuguese),

ru

(Russian),

ar

(Arabic). Omit

--language

to auto-detect.

Configuration

bash

# Save your API key once
python3 skills/speech-to-text/scripts/stt.py config --set-api-key YOUR_KEY

# Or set via environment variable
export NOIZ_API_KEY=YOUR_KEY

Get your API key at developers.noiz.ai.

Pricing

Billed at $0.0006 per second of audio. A 10-minute file costs ~$0.36. New accounts include 10,000 free TTS characters; STT is billed separately.

Security & data disclosure

Credential storage: API key is saved to
```
~/.config/noiz/api_key
```
(permissions
```
0600
```
).
```
NOIZ_API_KEY
```
env var is also supported.
Network calls: The audio file is uploaded to
```
https://noiz.ai/v1/speech-to-text
```
for transcription. No data is sent until you run the command.
File limits: Max 50 MB per file, max 10 minutes (600 seconds) of audio.

Requirements

```
requests
```
package:
```
pip install requests
```
Get your API key at developers.noiz.ai

speech-to-text

NPX Install

Tags

SKILL.md Content

speech-to-text

Triggers

Quick Start

Arguments

Output Format

Supported Languages

Configuration

Pricing

Security & data disclosure

Requirements