audio-transcribe
Original:🇺🇸 English
Translated
Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.
1installs
Sourceagntswrm/agent-media
Added on
NPX Install
npx skill4agent add agntswrm/agent-media audio-transcribeTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Audio Transcribe
Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.
Command
bash
agent-media audio transcribe --in <path> [options]Inputs
| Option | Required | Description |
|---|---|---|
| Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) |
| No | Enable speaker identification |
| No | Language code (auto-detected if not provided) |
| No | Number of speakers hint for diarization |
| No | Output path, filename or directory (default: ./) |
| No | Provider to use (local, fal, replicate, runpod) |
Output
Returns a JSON object with transcription data:
json
{
"ok": true,
"media_type": "audio",
"action": "transcribe",
"provider": "fal",
"output_path": "transcription_123_abc.json",
"transcription": {
"text": "Full transcription text...",
"language": "en",
"segments": [
{ "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
{ "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
]
}
}Examples
Basic transcription (auto-detect language):
bash
agent-media audio transcribe --in interview.mp3Transcription with speaker identification:
bash
agent-media audio transcribe --in meeting.wav --diarizeTranscription with specific language and speaker count:
bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3Use specific provider:
bash
agent-media audio transcribe --in audio.wav --provider replicateExtracting Audio from Video
To transcribe a video file, first extract the audio:
bash
# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3
# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3Providers
local
Runs locally on CPU using Transformers.js, no API key required.
- Uses Moonshine model (5x faster than Whisper)
- Models downloaded on first use (~100MB)
- Does NOT support diarization — use fal or replicate for speaker identification
- You may see a error — ignore it, the output is correct if
mutex lock failed"ok": true
bash
agent-media audio transcribe --in audio.mp3 --provider localfal
- Requires
FAL_API_KEY - Uses model for fast transcription (2x faster) when diarization is disabled
wizper - Uses model when diarization is enabled (native support)
whisper
replicate
- Requires
REPLICATE_API_TOKEN - Uses model with Whisper Large V3 Turbo
whisper-diarization - Native diarization support with word-level timestamps
runpod
- Requires
RUNPOD_API_KEY - Uses model (Whisper Large V3)
pruna/whisper-v3-large - Does NOT support diarization (speaker identification) - use fal or replicate for diarization
bash
agent-media audio transcribe --in audio.mp3 --provider runpod