Audio to SRT: turn any audio file into a subtitle file (2026)

Upload MP3, WAV, M4A, OGG, OPUS, or FLAC — get a .srt with word-level timestamps in minutes. Plus the downstream workflows every SERP page skips: Podcasting 2.0 transcripts, translation pivots, course platforms.

Upload any audio file — MP3, WAV, M4A, OGG, OPUS, FLAC, WMA, AIFF, up to 5 GB — and get a .srtsubtitle file with word-level timestamps in 1–3 minutes. DeluxeScribe accepts every common audio format, exports .srt / .vtt / .docx / PDF / JSON in one flow, and handles 99 languages with automatic detection. 60 minutes free, no credit card. Below: the upload workflow, the honest free path (self-hosted Whisper), the timing accuracy caveat every vendor hides, and — the section every SERP competitor skips — what to do with the .srt when you have audio but no video yet (podcast transcript publishing, translation pivots, course platforms).
  • 60 minutes free
  • No credit card
  • 99 languages
  • Speaker labels

Last verified July 4, 2026

TL;DR — pick your path

Your situationBest path
I have audio + a matching video coming laterGenerate .srt now, attach when video’s ready
Audio-only, want to publish transcript with a podcastPodcasting 2.0 workflow
Audio in one language, want captions in multipleTranslation pivot
Audio for a course platformCheck platform-specific caption spec
Just want text (no timestamps)Wrong page — use format-specific transcription
Broadcast-standard captionsHuman captioner
Sensitive content (can’t upload)Self-hosted Whisper
Own YouTube video with this audioYouTube Studio auto-caption download
I have .srt, want audio (TTS)Wrong direction — see disambiguation

Are you sure you want .srt?

Quick disambiguation before you spend a minute on the workflow — four related queries land on this page:

  • You want the text without timestamps → you don’t need .srt. Use one of our format-specific transcription pages: for MP3 use MP3 to Text, for WAV use WAV to Text, for M4A use M4A to Text, or the pillar How to Transcribe Audio. You’ll get plain text, .docx, or PDF export instead of timed subtitles.
  • You have an .srt and want AUDIO from it (TTS) → this is text-to-speech synthesis, a completely different capability. Use ElevenLabs, Play.ht, or Descript Overdub — those services generate spoken audio from written text. We convert audio into subtitle files; the reverse direction requires TTS.
  • You want a video with burned-in captions → generate the .srt from your audio here first, then use Video to SRT §8 burn-in for the FFmpeg command that renders the .srt permanently into video pixels (for TikTok / Reels / Shorts).
  • You want .vtt (WebVTT) instead of .srt → DeluxeScribe exports both from the same transcription in one flow. Pick .vtt if you’re publishing to HTML5 web video or Podcasting 2.0 apps; .srt covers 95% of other use cases including desktop video editors.

If none of these apply and you want a timed .srt subtitle file from your audio — keep reading.

Upload workflow

The workflow, in 4 steps:

  1. Sign up (60 minutes free, no credit card).
  2. Upload the audio file— MP3, WAV, M4A, OGG, OPUS, FLAC, WMA, AAC, AIFF all accepted. Up to 5 GB per file (~40 hours of typical 128 kbps MP3, ~15 hours of high-bitrate M4A, ~8 hours of 44.1 kHz stereo WAV).
  3. Language auto-detects.Leave it unless you want to force a specific dialect. Turn speaker labels off for standard subtitle work — they appear as Speaker 1: prefixes in each cue, which usually gets in the way. Turn them on only for interviews or panel discussions where attribution matters.
  4. Wait 1–3 minutes for typical files (30-60 min audio). Multi-hour files scale roughly linearly.
  5. Export as .srt.In the editor, click Export → SRT. Also available: .vtt for web video and Podcasting 2.0, .docx for editing, .jsonfor word-level timestamps (useful if you’re doing custom segmentation with Subtitle Edit or Aegisub).

Generate .srt from your audio in minutes

60 minutes free, no credit card. Accepts MP3, WAV, M4A, OGG, OPUS, FLAC and all common formats up to 5 GB. Exports .srt, .vtt, .docx, PDF, and JSON with word-level timestamps.

If your audio is larger than 5 GB

Two options:

Convert to a lower-bitrate format first:

ffmpeg -i input.wav -c:a aac -b:a 96k output.m4a

A 5 GB WAV becomes a 300–500 MB M4A with negligible transcription accuracy loss for speech content (96 kbps AAC is above the quality floor AI models care about).

Or split into hour-long segments:

ffmpeg -i input.mp3 -f segment -segment_time 3600 -c copy part_%03d.mp3

Produces one file per hour. Upload each separately, then concatenate the resulting .srt files with timestamp offsets. Cross-reference the SRT generator page for the offset math.

Free extraction paths

Three legitimate free paths, each fits a different case:

1. Self-hosted Whisper (private, no upload)

Install once, process any audio locally:

pip install openai-whisper

Then:

whisper input.mp3 --model large-v3 --output_format srt

Free forever, no upload, no size limit. Speed:

  • CPU:10–30× real-time (1-hour audio takes 10–30 hours)
  • Apple Silicon: near real-time with MLX Whisper
  • NVIDIA GPU (RTX 3060+):real-time to 2× faster

Fits: sensitive audio (medical, legal, PHI), one-time batch of a big archive, learning the underlying model. Downside: no browser editor, no speaker labels out of the box (add WhisperX for that), requires Python setup.

2. YouTube Studio (if you own or will make a matching video)

If your audio has an accompanying video (or you plan to make one), YouTube’s auto-captions download works:

  1. Upload video to YouTube (unlisted is fine)
  2. Wait 5–30 minutes for auto-caption processing
  3. YouTube Studio → Subtitles → three-dot menu on the English (auto-generated) track → Download
  4. Choose .vtt or .sbv; convert to .srt with any free tool (Subtitle Edit does it in one click)

Free for any video you can upload. Quality: good English, mediocre non-English. Time cost: 5–30 min plus upload time.

3. DeluxeScribe free tier

60 minutes of full-app credit, one-time, no credit card. Fits: creators trying the workflow before committing, or occasional users with less than an hour of audio.

AI timing accuracy — the caveat every vendor hides

Two distinct accuracy dimensions that vendor pages conflate:

  • Word-level timestamps.Whisper is accurate to ~200ms per word on clean speech. That’s professional-grade timing.
  • Segment (caption cue) boundaries. Whisper groups words into cues using pause detection. These boundaries don’t respect professional captioning ruleslike BBC’s 42-character line max, 17 CPS reading speed, or Netflix’s Timed Text Style Guide.

For audio-only content, use-case impact varies:

Use caseAI .srt as-is?Cleanup time (30-min audio)
Podcast transcript (Podcasting 2.0)Yes0-5 min
Blog post embedded transcriptYes0-5 min
Course platform caption (LMS)Sometimes10-15 min
Later video release (YouTube)Yes0-10 min
Streaming platform QC (Netflix, HBO)No30-60 min or captioner
Broadcast delivery (BBC iPlayer)NoSend to captioner (§8)
Translation pivot (source-language clean)Yes10-15 min

For serious cleanup work, use Subtitle Edit (free) or Aegisub. Both have built-in CPS validators and line-break tooling. See our SRT generator page for the full segmentation rules.

What to do with the .srt (audio-first workflows)

The section every SERP competitor skips. Six honest downstream paths for audio-first creators:

A. Attach to a matching video (later)

If your workflow is audio-first, video-later (music video, animated explainer, YouTube commentary edited after the recording), keep the .srt for when the video is ready. Options:

  • Mux as soft-sub (viewer can toggle on/off):
    ffmpeg -i video.mp4 -i subs.srt -c copy -c:s mov_text output.mp4
  • Burn-in for social platforms (TikTok, Reels, Shorts):
    ffmpeg -i video.mp4 -vf "subtitles=subs.srt" -c:a copy output.mp4
  • Import into your NLE.Premiere, DaVinci Resolve, Final Cut Pro, CapCut — see Video to SRT §9 for editor-specific import steps.

B. Publish as podcast transcript (Podcasting 2.0)

The Podcasting 2.0 <podcast:transcript> namespace element lets you publish a transcript file alongside each podcast episode in your RSS feed. Modern independent podcast apps — Podverse, Fountain, Podcast Guru, CurioCaster— render the tag natively. Apple Podcasts uses its own system and doesn’t currently honor the tag, but the open standard works across the rest of the ecosystem.

Example RSS snippet:

<item>
  <title>Episode 42 — On Transcripts</title>
  <enclosure url="https://example.com/ep42.mp3" length="42000000" type="audio/mpeg"/>
  <podcast:transcript
    url="https://example.com/ep42.srt"
    type="application/x-subrip"
    language="en"
    rel="captions"/>
  <podcast:transcript
    url="https://example.com/ep42.vtt"
    type="text/vtt"
    language="en"/>
</item>

Multiple <podcast:transcript>elements per episode are supported — different formats, different languages, captions vs full transcript. Cross- reference Podcast Transcription for the full workflow, format choice (VTT vs SRT vs JSON), and host-side considerations.

C. Translation pivot for multi-language captions

.srt is plain text so any translation workflow preserves timing. Two paths:

  • Manual translation(highest quality): copy each cue’s text lines to a translator or LLM, replace, preserve timestamps. Use for accuracy-critical content.
  • Automated translation: DeepL, Google Translate, or Subtitle Edit’s built-in translate feature. Fast, less accurate on idiom-heavy content.

Timing gotcha:translated text length varies (Spanish +25% vs English, Japanese -10%). CPS may exceed 17-character-per-second limits after translation — you may need to re-segment or reduce the source content. Subtitle Edit’s CPS validator flags this automatically.

D. Embed in a course platform

Major course platform caption support:

  • Kajabi— .vtt for HTML5 video player captions
  • Teachable— .vtt or .srt via Wistia/YouTube video hosting
  • Thinkific— .vtt native support
  • Podia— .vtt via video host
  • LearnWorlds— .vtt and .srt native support
  • Vimeo Business/Premium(common LMS backend) — both formats

Rule of thumb: check the platform’s caption docs first. Most modern platforms prefer .vtt (W3C web standard). DeluxeScribe exports both from the same transcription in one flow.

E. Accessibility compliance (WCAG)

WCAG 2.1 Success Criterion 1.2.1 requires transcripts for audio-only content and captions for video-with-audio. A properly-timed .srt satisfies the captions requirement when paired with a video; a plain-text version satisfies the audio-only transcript requirement. Legal requirement in many US states and EU jurisdictions for public-facing content, government websites, and educational institutions.

F. Feed to an LLM (structured input)

Timestamped .srt is a better LLM input than raw text — the timing signals give the model structure and let you cross-reference model output back to the source audio. Common uses:

  • Summarization with time-anchored bullet points
  • Chapter marker generation from topic transitions
  • Quote extraction with timestamp citations
  • Show-notes drafting from long-form audio

For structured LLM pipelines specifically, DeluxeScribe also exports word-level JSON — more precise than .srt’s cue-level timing.

Format-specific quirks

Brief format-specific notes. Deeper coverage in the format-specific pages we link to.

  • MP3. Below 96 kbps, high-frequency information stripping degrades consonant accuracy on voice content. Above 128 kbps for voice is safe. See MP3 to Text for the full bitrate-vs-accuracy discussion.
  • WAV. Highest AI accuracy because uncompressed, but large files. A 1-hour 44.1 kHz stereo WAV = ~600 MB. See WAV to Text for the file-size math and FFmpeg compression command.
  • M4A / AAC.Apple’s default AAC container. Usually 128-256 kbps AAC audio — transcription accuracy matches MP3 128+ kbps. Occasional edge case: M4A can contain non-AAC codecs (rare Opus in M4A). See M4A to Text for iPhone Voice Memo specifics.
  • OGG. Open-format container, usually holds Vorbis or Opus. Both are voice-friendly codecs; transcription accuracy is good.
  • OPUS. Modern speech codec (RFC 6716). Outperforms MP3 at low bitrates — a 32 kbps Opus voice recording transcribes about as accurately as 128 kbps MP3. Common in WhatsApp voice notes and modern messaging. Accepted directly in OGG or WebM containers.
  • FLAC. Lossless compressed. Same accuracy as WAV, ~50% smaller files. Good middle-ground for archival + transcription workflows.
  • WMA. Windows Media Audio. Older format; accepted but consider converting to MP3 for broader compatibility (ffmpeg -i input.wma -c:a libmp3lame -b:a 128k output.mp3).
  • AIFF.Apple’s uncompressed container (like WAV but Mac-native). Same accuracy as WAV, same size penalty.

When a human captioner fits better

Cases where AI-generated .srt (from DeluxeScribe or anyone else) is not the right choice:

  • Broadcast / OTT delivery with QC.BBC iPlayer, Netflix, HBO Max, Apple TV+, Disney+ all require specific timing standards that AI doesn’t deliver reliably. Spec compliance is mandatory; use a captioner.
  • WCAG AAA compliance for legal or regulatory work (courtroom audio evidence, medical patient education, government training).
  • High-visibility marketing audiowhere mis-captions damage brand — think investor day keynote audio, CEO announcement, major PR event.
  • Non-English content where the client is a native speaker of the target language. AI accuracy on non-English speech is worse than English; a native captioner catches errors AI misses.
  • Low-quality audio (poor mic, heavy background noise, distant speakers). AI struggles here; humans do better with context.

Named vendors — no rankings, just honest options:

  • Rev human tier— $1.50/audio min plus captioner review, US-based, fast turnaround
  • 3Play Media— broadcast-standard, dominant vendor in US streaming and broadcast
  • GoTranscript human tier— $1.10–$2.50/audio min, wider language coverage
  • Verbit— specialized in education, legal, government; AAA compliance standard

How this page was verified

SRT format details verified against the FFmpeg codec documentation and Matroska’s subtitle format reference. Podcasting 2.0 <podcast:transcript> tag spec verified against the Podcast Namespace repository. Opus codec details come from RFC 6716. WCAG requirements for audio-only content transcripts come from WCAG 2.1 Success Criterion 1.2.1. AI timing accuracy ranges come from our own testing of 10 sample audio files against reference human transcripts — see How Accurate Is Whisper for the broader WER-by-condition breakdown. We don’t cite the “98 languages” and “most accurate AI in the world” claims common on this SERP because we can’t source them to a published benchmark on realistic audio.

Frequently Asked Questions

Can I convert any audio to .srt?

Any audio file with speech, in any common format. MP3, WAV, M4A, OGG, OPUS, FLAC, WMA, AAC, AIFF all accepted. Non-speech audio (music-only, silent tracks, tone recordings) produces empty output — nothing to transcribe. Non-English content works in 99 languages via automatic detection; give a language hint if the recording is under 30 seconds or heavily accented.

What audio formats work?

MP3, WAV, M4A, AAC, OGG, OPUS, FLAC, WMA, AIFF, and most audio codecs commonly found in the wild. Container support includes MP3, WAV, M4A/MP4 audio-only, OGG, WebM audio-only, FLAC, and legacy formats. Non-standard configurations (WAV with Opus inside, M4A with legacy codecs) may need FFmpeg conversion first — see the format-specific quirks section below.

What's the free way to get .srt from audio?

Two honest paths. (1) Self-hosted Whisper: pip install openai-whisper, then whisper input.mp3 --model large-v3 --output_format srt. Free forever, private, no upload. Slow on CPU (10-30x real-time), fast on Apple Silicon or NVIDIA GPU. (2) DeluxeScribe free tier: 60 minutes free one-time, no credit card. If you also have or plan to make a video, YouTube Studio auto-caption download works for videos you own.

How accurate is AI-generated .srt from audio?

Word-level timestamps are accurate to ~200ms on clean speech. Segment boundaries (where each caption cue starts and ends) are heuristic, based on pause detection. For podcast transcripts, blog embeds, and translation pivots, AI output is fine as-is. For streaming platform QC (Netflix, HBO, Apple TV+) or broadcast delivery, the AI output needs manual segmentation cleanup — or send to a human captioner. See the timing accuracy section below.

Can I convert MP3 to .srt without downloading anything?

Yes — sign up for DeluxeScribe, upload the MP3 in the browser, get a .srt back within 1-3 minutes. No download, no software install. If you want to avoid the account signup, self-hosted Whisper is the free-forever alternative but requires installing Python and running commands locally.

What do I do with the .srt when I only have audio, no video?

Four common paths for audio-first creators. (1) Publish as a podcast transcript via Podcasting 2.0's <podcast:transcript> tag — Podverse, Fountain, Podcast Guru, and CurioCaster render these natively. (2) Use as a translation pivot for multi-language captions. (3) Embed in a course platform (Kajabi, Teachable, Thinkific all support .vtt and most support .srt). (4) Keep it for when the video is ready and attach later. Full downstream workflow section below.

Does DeluxeScribe support Podcasting 2.0 transcript tags?

The exported .srt or .vtt file is compatible with the Podcasting 2.0 <podcast:transcript> RSS extension — just host the file on your CDN and reference the URL in your feed. DeluxeScribe doesn't currently manage the RSS feed for you (that's your podcast host's job). See our podcast transcription guide for the full Podcasting 2.0 spec, example RSS snippet, and hosting recommendations.

Can I translate an .srt to another language?

Yes — .srt is plain text so any translation workflow preserves the timing. Two paths. (1) Manual: copy each cue's text lines to a translator, replace, keep timestamps. (2) Automated: DeepL, Google Translate, or specialized subtitle-translation tools like Subtitle Edit's built-in translate feature. Timing gotcha: translated text length differs (Spanish +25% vs English, Japanese -10%) so CPS may exceed limits — you may need to re-segment.

What's the difference between .srt and .vtt?

.srt (SubRip) is the oldest and most widely supported subtitle format — plain text with numbered cues and HH:MM:SS,MMM timestamps. .vtt (WebVTT) is the W3C standard for web video, supports styling and positioning natively, uses HH:MM:SS.MMM (dot instead of comma). For video editors and desktop players, .srt is the safer choice. For Podcasting 2.0 apps and HTML5 web video, .vtt is preferred. DeluxeScribe exports both.

Can I get audio from an .srt file (reverse direction)?

That's TTS (text-to-speech), not our tool. If you have an .srt and want AI-generated audio narration, use ElevenLabs, Play.ht, or Descript's Overdub. Those services synthesize speech from text — a fundamentally different capability. This page covers the audio-to-.srt direction only.