Audio to SRT: turn any audio file into a subtitle file (2026)
Upload MP3, WAV, M4A, OGG, OPUS, or FLAC — get a .srt with word-level timestamps in minutes. Plus the downstream workflows every SERP page skips: Podcasting 2.0 transcripts, translation pivots, course platforms.
.srtsubtitle file with word-level timestamps in 1–3 minutes. DeluxeScribe accepts every common audio format, exports .srt / .vtt / .docx / PDF / JSON in one flow, and handles 99 languages with automatic detection. 60 minutes free, no credit card. Below: the upload workflow, the honest free path (self-hosted Whisper), the timing accuracy caveat every vendor hides, and — the section every SERP competitor skips — what to do with the .srt when you have audio but no video yet (podcast transcript publishing, translation pivots, course platforms).- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified July 4, 2026
TL;DR — pick your path
| Your situation | Best path |
|---|---|
| I have audio + a matching video coming later | Generate .srt now, attach when video’s ready |
| Audio-only, want to publish transcript with a podcast | Podcasting 2.0 workflow |
| Audio in one language, want captions in multiple | Translation pivot |
| Audio for a course platform | Check platform-specific caption spec |
| Just want text (no timestamps) | Wrong page — use format-specific transcription |
| Broadcast-standard captions | Human captioner |
| Sensitive content (can’t upload) | Self-hosted Whisper |
| Own YouTube video with this audio | YouTube Studio auto-caption download |
| I have .srt, want audio (TTS) | Wrong direction — see disambiguation |
Are you sure you want .srt?
Quick disambiguation before you spend a minute on the workflow — four related queries land on this page:
- You want the text without timestamps → you don’t need .srt. Use one of our format-specific transcription pages: for MP3 use MP3 to Text, for WAV use WAV to Text, for M4A use M4A to Text, or the pillar How to Transcribe Audio. You’ll get plain text, .docx, or PDF export instead of timed subtitles.
- You have an .srt and want AUDIO from it (TTS) → this is text-to-speech synthesis, a completely different capability. Use ElevenLabs, Play.ht, or Descript Overdub — those services generate spoken audio from written text. We convert audio into subtitle files; the reverse direction requires TTS.
- You want a video with burned-in captions → generate the .srt from your audio here first, then use Video to SRT §8 burn-in for the FFmpeg command that renders the .srt permanently into video pixels (for TikTok / Reels / Shorts).
- You want .vtt (WebVTT) instead of .srt → DeluxeScribe exports both from the same transcription in one flow. Pick .vtt if you’re publishing to HTML5 web video or Podcasting 2.0 apps; .srt covers 95% of other use cases including desktop video editors.
If none of these apply and you want a timed .srt subtitle file from your audio — keep reading.
Upload workflow
The workflow, in 4 steps:
- Sign up (60 minutes free, no credit card).
- Upload the audio file— MP3, WAV, M4A, OGG, OPUS, FLAC, WMA, AAC, AIFF all accepted. Up to 5 GB per file (~40 hours of typical 128 kbps MP3, ~15 hours of high-bitrate M4A, ~8 hours of 44.1 kHz stereo WAV).
- Language auto-detects.Leave it unless you want to force a specific dialect. Turn speaker labels off for standard subtitle work — they appear as
Speaker 1:prefixes in each cue, which usually gets in the way. Turn them on only for interviews or panel discussions where attribution matters. - Wait 1–3 minutes for typical files (30-60 min audio). Multi-hour files scale roughly linearly.
- Export as .srt.In the editor, click Export → SRT. Also available:
.vttfor web video and Podcasting 2.0,.docxfor editing,.jsonfor word-level timestamps (useful if you’re doing custom segmentation with Subtitle Edit or Aegisub).
Generate .srt from your audio in minutes
60 minutes free, no credit card. Accepts MP3, WAV, M4A, OGG, OPUS, FLAC and all common formats up to 5 GB. Exports .srt, .vtt, .docx, PDF, and JSON with word-level timestamps.
If your audio is larger than 5 GB
Two options:
Convert to a lower-bitrate format first:
ffmpeg -i input.wav -c:a aac -b:a 96k output.m4aA 5 GB WAV becomes a 300–500 MB M4A with negligible transcription accuracy loss for speech content (96 kbps AAC is above the quality floor AI models care about).
Or split into hour-long segments:
ffmpeg -i input.mp3 -f segment -segment_time 3600 -c copy part_%03d.mp3Produces one file per hour. Upload each separately, then concatenate the resulting .srt files with timestamp offsets. Cross-reference the SRT generator page for the offset math.
Free extraction paths
Three legitimate free paths, each fits a different case:
1. Self-hosted Whisper (private, no upload)
Install once, process any audio locally:
pip install openai-whisperThen:
whisper input.mp3 --model large-v3 --output_format srtFree forever, no upload, no size limit. Speed:
- CPU:10–30× real-time (1-hour audio takes 10–30 hours)
- Apple Silicon: near real-time with MLX Whisper
- NVIDIA GPU (RTX 3060+):real-time to 2× faster
Fits: sensitive audio (medical, legal, PHI), one-time batch of a big archive, learning the underlying model. Downside: no browser editor, no speaker labels out of the box (add WhisperX for that), requires Python setup.
2. YouTube Studio (if you own or will make a matching video)
If your audio has an accompanying video (or you plan to make one), YouTube’s auto-captions download works:
- Upload video to YouTube (unlisted is fine)
- Wait 5–30 minutes for auto-caption processing
- YouTube Studio → Subtitles → three-dot menu on the English (auto-generated) track → Download
- Choose
.vttor.sbv; convert to.srtwith any free tool (Subtitle Edit does it in one click)
Free for any video you can upload. Quality: good English, mediocre non-English. Time cost: 5–30 min plus upload time.
3. DeluxeScribe free tier
60 minutes of full-app credit, one-time, no credit card. Fits: creators trying the workflow before committing, or occasional users with less than an hour of audio.
AI timing accuracy — the caveat every vendor hides
Two distinct accuracy dimensions that vendor pages conflate:
- Word-level timestamps.Whisper is accurate to ~200ms per word on clean speech. That’s professional-grade timing.
- Segment (caption cue) boundaries. Whisper groups words into cues using pause detection. These boundaries don’t respect professional captioning ruleslike BBC’s 42-character line max, 17 CPS reading speed, or Netflix’s Timed Text Style Guide.
For audio-only content, use-case impact varies:
| Use case | AI .srt as-is? | Cleanup time (30-min audio) |
|---|---|---|
| Podcast transcript (Podcasting 2.0) | Yes | 0-5 min |
| Blog post embedded transcript | Yes | 0-5 min |
| Course platform caption (LMS) | Sometimes | 10-15 min |
| Later video release (YouTube) | Yes | 0-10 min |
| Streaming platform QC (Netflix, HBO) | No | 30-60 min or captioner |
| Broadcast delivery (BBC iPlayer) | No | Send to captioner (§8) |
| Translation pivot (source-language clean) | Yes | 10-15 min |
For serious cleanup work, use Subtitle Edit (free) or Aegisub. Both have built-in CPS validators and line-break tooling. See our SRT generator page for the full segmentation rules.
What to do with the .srt (audio-first workflows)
The section every SERP competitor skips. Six honest downstream paths for audio-first creators:
A. Attach to a matching video (later)
If your workflow is audio-first, video-later (music video, animated explainer, YouTube commentary edited after the recording), keep the .srt for when the video is ready. Options:
- Mux as soft-sub (viewer can toggle on/off):
ffmpeg -i video.mp4 -i subs.srt -c copy -c:s mov_text output.mp4 - Burn-in for social platforms (TikTok, Reels, Shorts):
ffmpeg -i video.mp4 -vf "subtitles=subs.srt" -c:a copy output.mp4 - Import into your NLE.Premiere, DaVinci Resolve, Final Cut Pro, CapCut — see Video to SRT §9 for editor-specific import steps.
B. Publish as podcast transcript (Podcasting 2.0)
The Podcasting 2.0 <podcast:transcript> namespace element lets you publish a transcript file alongside each podcast episode in your RSS feed. Modern independent podcast apps — Podverse, Fountain, Podcast Guru, CurioCaster— render the tag natively. Apple Podcasts uses its own system and doesn’t currently honor the tag, but the open standard works across the rest of the ecosystem.
Example RSS snippet:
<item>
<title>Episode 42 — On Transcripts</title>
<enclosure url="https://example.com/ep42.mp3" length="42000000" type="audio/mpeg"/>
<podcast:transcript
url="https://example.com/ep42.srt"
type="application/x-subrip"
language="en"
rel="captions"/>
<podcast:transcript
url="https://example.com/ep42.vtt"
type="text/vtt"
language="en"/>
</item>Multiple <podcast:transcript>elements per episode are supported — different formats, different languages, captions vs full transcript. Cross- reference Podcast Transcription for the full workflow, format choice (VTT vs SRT vs JSON), and host-side considerations.
C. Translation pivot for multi-language captions
.srt is plain text so any translation workflow preserves timing. Two paths:
- Manual translation(highest quality): copy each cue’s text lines to a translator or LLM, replace, preserve timestamps. Use for accuracy-critical content.
- Automated translation: DeepL, Google Translate, or Subtitle Edit’s built-in translate feature. Fast, less accurate on idiom-heavy content.
Timing gotcha:translated text length varies (Spanish +25% vs English, Japanese -10%). CPS may exceed 17-character-per-second limits after translation — you may need to re-segment or reduce the source content. Subtitle Edit’s CPS validator flags this automatically.
D. Embed in a course platform
Major course platform caption support:
- Kajabi— .vtt for HTML5 video player captions
- Teachable— .vtt or .srt via Wistia/YouTube video hosting
- Thinkific— .vtt native support
- Podia— .vtt via video host
- LearnWorlds— .vtt and .srt native support
- Vimeo Business/Premium(common LMS backend) — both formats
Rule of thumb: check the platform’s caption docs first. Most modern platforms prefer .vtt (W3C web standard). DeluxeScribe exports both from the same transcription in one flow.
E. Accessibility compliance (WCAG)
WCAG 2.1 Success Criterion 1.2.1 requires transcripts for audio-only content and captions for video-with-audio. A properly-timed .srt satisfies the captions requirement when paired with a video; a plain-text version satisfies the audio-only transcript requirement. Legal requirement in many US states and EU jurisdictions for public-facing content, government websites, and educational institutions.
F. Feed to an LLM (structured input)
Timestamped .srt is a better LLM input than raw text — the timing signals give the model structure and let you cross-reference model output back to the source audio. Common uses:
- Summarization with time-anchored bullet points
- Chapter marker generation from topic transitions
- Quote extraction with timestamp citations
- Show-notes drafting from long-form audio
For structured LLM pipelines specifically, DeluxeScribe also exports word-level JSON — more precise than .srt’s cue-level timing.
Format-specific quirks
Brief format-specific notes. Deeper coverage in the format-specific pages we link to.
- MP3. Below 96 kbps, high-frequency information stripping degrades consonant accuracy on voice content. Above 128 kbps for voice is safe. See MP3 to Text for the full bitrate-vs-accuracy discussion.
- WAV. Highest AI accuracy because uncompressed, but large files. A 1-hour 44.1 kHz stereo WAV = ~600 MB. See WAV to Text for the file-size math and FFmpeg compression command.
- M4A / AAC.Apple’s default AAC container. Usually 128-256 kbps AAC audio — transcription accuracy matches MP3 128+ kbps. Occasional edge case: M4A can contain non-AAC codecs (rare Opus in M4A). See M4A to Text for iPhone Voice Memo specifics.
- OGG. Open-format container, usually holds Vorbis or Opus. Both are voice-friendly codecs; transcription accuracy is good.
- OPUS. Modern speech codec (RFC 6716). Outperforms MP3 at low bitrates — a 32 kbps Opus voice recording transcribes about as accurately as 128 kbps MP3. Common in WhatsApp voice notes and modern messaging. Accepted directly in OGG or WebM containers.
- FLAC. Lossless compressed. Same accuracy as WAV, ~50% smaller files. Good middle-ground for archival + transcription workflows.
- WMA. Windows Media Audio. Older format; accepted but consider converting to MP3 for broader compatibility (
ffmpeg -i input.wma -c:a libmp3lame -b:a 128k output.mp3). - AIFF.Apple’s uncompressed container (like WAV but Mac-native). Same accuracy as WAV, same size penalty.
When a human captioner fits better
Cases where AI-generated .srt (from DeluxeScribe or anyone else) is not the right choice:
- Broadcast / OTT delivery with QC.BBC iPlayer, Netflix, HBO Max, Apple TV+, Disney+ all require specific timing standards that AI doesn’t deliver reliably. Spec compliance is mandatory; use a captioner.
- WCAG AAA compliance for legal or regulatory work (courtroom audio evidence, medical patient education, government training).
- High-visibility marketing audiowhere mis-captions damage brand — think investor day keynote audio, CEO announcement, major PR event.
- Non-English content where the client is a native speaker of the target language. AI accuracy on non-English speech is worse than English; a native captioner catches errors AI misses.
- Low-quality audio (poor mic, heavy background noise, distant speakers). AI struggles here; humans do better with context.
Named vendors — no rankings, just honest options:
- Rev human tier— $1.50/audio min plus captioner review, US-based, fast turnaround
- 3Play Media— broadcast-standard, dominant vendor in US streaming and broadcast
- GoTranscript human tier— $1.10–$2.50/audio min, wider language coverage
- Verbit— specialized in education, legal, government; AAA compliance standard
How this page was verified
<podcast:transcript> tag spec verified against the Podcast Namespace repository. Opus codec details come from RFC 6716. WCAG requirements for audio-only content transcripts come from WCAG 2.1 Success Criterion 1.2.1. AI timing accuracy ranges come from our own testing of 10 sample audio files against reference human transcripts — see How Accurate Is Whisper for the broader WER-by-condition breakdown. We don’t cite the “98 languages” and “most accurate AI in the world” claims common on this SERP because we can’t source them to a published benchmark on realistic audio.Related guides
- Video to SRT (video source)Sibling companion — same workflow, video source. Includes FFmpeg mux and burn-in commands and NLE-specific import steps.
- SRT Generator (format explainer)The SRT format itself — anatomy, BBC/Netflix timing rules, .srt vs .vtt, and the 4 ways to generate ranked by use case.
- Podcast TranscriptionPodcasting 2.0 workflow — publish .srt or .vtt via the <podcast:transcript> RSS tag, plus the transcript → show notes 20-minute workflow.
- How to Transcribe Audio (pillar)The broader pillar — every path across sources and formats and how to pick.