MP3 to Text: How to convert audio recordings into transcripts
Four ranked methods, honest accuracy expectations, and the bitrate truth most pages get wrong.
- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified June 23, 2026
The 4 ways to convert MP3 to text (free and paid)
Ranked by what most people actually need: an honest tradeoff between cost, quality, setup time, and privacy.
| Method | Cost | Quality | Setup | Privacy |
|---|---|---|---|---|
| DeluxeScribe (or any modern AI service) | Free trial → $10/mo | High | None | Cloud-encrypted |
| YouTube unlisted upload | Free | Medium | Google account | Public-with-URL |
| Self-hosted Whisper | Free | High | Python install | Local |
| Apple Voice Memos / Live Captions | Free | Medium-high | iPhone or Mac | On-device |
1. DeluxeScribe (or any modern AI transcription service)
Upload the MP3, wait 5–10 minutes per hour of audio, download in any of six formats. Best for: lots of files, multi-language content, anything you need speaker labels for, or files where you want a clickable transcript editor instead of plain text. DeluxeScribe’s free tier (60 minutes, no card) is enough to transcribe a typical interview series or podcast episode without paying.
2. YouTube unlisted upload + caption download
Upload your MP3 to YouTube as “unlisted” (convert to MP4 with a static image first — ffmpeg one-liner below), wait for auto-captions to finish, then download the SRT. Free, useful for short clips, but two catches: “unlisted” videos are still publicly accessible to anyone with the URL, and YouTube’s captions are noticeably less accurate than modern dedicated ASR.
ffmpeg -loop 1 -i still.jpg -i audio.mp3 -c:v libx264 \ -c:a copy -shortest video.mp4
3. Self-hosted Whisper
OpenAI’s Whisper model is free and open. With Python:
pip install openai-whisper whisper recording.mp3 --model large-v3 --output_format txt
Fully private (audio never leaves your machine), high accuracy, and supports the same 99 languages as commercial services. The real cost is your own time and hardware — a 1-hour MP3 takes 10–30 hours on a CPU, ~5 minutes on a recent GPU. Worth it for recurring use or sensitive content; overkill for a one-off 10-minute clip.
4. Apple Voice Memos / Live Captions
On an iPhone running iOS 18 or later, the Voice Memos app transcribes recordings on-device for English. Files made elsewhere can be opened in Voice Memos and transcribed the same way. Limits: only English (officially), no speaker labels, no SRT export. For a quick English-only voice memo transcript, it’s the easiest free option.
Accuracy: what you can actually expect
Every transcription product claims “99% accurate.” None of them are, in the general case. Accuracy depends entirely on the recording conditions — the same ASR model will hit 97% on a podcast and 75% on a phone call. Here’s a realistic range:
| Recording type | Realistic accuracy | Notes |
|---|---|---|
| Studio podcast (single mic, post-production) | 95–98% | Best case for ASR. |
| Zoom meeting (good headset) | 90–95% | Compression artefacts cost a few points. |
| Phone interview / VoIP | 80–90% | Bandlimited audio drops harder consonants. |
| Lecture hall from back row | 75–85% | Reverb and distance hurt more than noise. |
| Outdoor or on-the-street | 60–80% | Wind and traffic dominate. Edit time exceeds typing it. |
The honest takeaway: AI transcription is great as a first draft for most content. Expect to spend 10–20% of the audio runtime reviewing and correcting — much less than typing from scratch, but not zero.
Does MP3 bitrate matter? (Mostly no)
The marketing pages that warn you about “quality loss from low bitrate” are mostly trying to upsell you into a premium plan. Here’s the actual engineering:
Modern ASR models (Whisper, Conformer, wav2vec) operate on mel-spectrograms sampled at 16 kHz. That captures everything from 0 to 8 kHz, which covers all of human speech. MP3 at 64 kbps and above preserves that frequency range losslessly in the speech band — the lossy compression mostly throws away high-frequency content above 12 kHz that the model never looks at anyway.
Practical result: 320 kbps vs 128 kbps vs 64 kbps produces roughly the same transcription accuracy (within 1%) on the same source recording. The thing that actually matters is the source recording quality, not the compression applied afterward.
When to actually worry: MP3s encoded below 32 kbps, often generated by old voicemail systems or analog-to-digital phone recorders. At that bitrate, even mid-frequencies start to degrade. If you have control over the source, record at 96 kbps or higher and don’t re-encode.
Export formats (and which to pick)
| Format | Best for | Includes timestamps |
|---|---|---|
| TXT | Pasting into a doc, search | No (or inline) |
| DOCX | Sharing with non-technical users | Optional |
| Archive, deliverable | Optional | |
| SRT | Video subtitles | Yes (cue-level) |
| VTT | HTML5 video captions | Yes (cue-level) |
| JSON | Programmatic use, custom pipelines | Yes (word-level) |
For most people: DOCX for sharing, JSON if you’re building anything programmatic, SRT/VTT if you’re going to video. TXT and PDF are useful but less flexible.
Common failures and how to fix them
“Silent” MP3
The most common cause is a container/codec mismatch — the file has the .mp3 extension but actually contains AAC, Opus, or no audio stream at all. Check with ffprobe yourfile.mp3. If it’s not actually MP3, re-encode: ffmpeg -i input.xxx -acodec libmp3lame output.mp3.
M4P confusion
.m4pis a different format: AAC audio with FairPlay DRM, used by some legacy iTunes Store purchases. It’s not the same as MP3 or M4A and most transcription services reject it. Strip the DRM by converting in iTunes (when allowed) or re-record.
File too large
Most services cap individual uploads at 1–5 GB. For longer content, split with ffmpeg into smaller chunks:
ffmpeg -i long.mp3 -c copy -segment_time 1800 -f segment chunk_%03d.mp3
Wrong language detected
Bilingual recordings (English + Spanish in the same file) confuse auto-detection. Either set the language manually in the transcription settings, or split the file by language section first.
When DeluxeScribe is the right tool
The right answer is us when you have multiple files, need speaker labels, want exportable formats beyond plain text, work in a language other than English, or want a transcript editor to clean up the result. Our 60-minute free tier covers a typical interview series or podcast episode without paying anything.
The right answer isn’t uswhen the content is sensitive enough that an encrypted cloud upload is still too much risk (run Whisper locally), or when it’s a 90-second voice memo (use Apple’s built-in for free).
How this page was verified
Related guides
- M4A to TextiPhone Voice Memos and other M4A files. Covers iOS 18's built-in transcription.
- MP4 to TextTranscribe video files. Includes the free YouTube workaround and ffmpeg extraction.
- SRT GeneratorGenerate subtitle files with the timing rules Netflix and BBC use.
- Pricing60 minutes free, then $10/month for 600 minutes. Per-minute rates beyond that.