MP3 to Text: How to convert audio recordings into transcripts

Four ranked methods, honest accuracy expectations, and the bitrate truth most pages get wrong.

MP3 to text conversion turns a recorded audio file into searchable, editable text using automatic speech recognition (ASR). The realistic accuracy on a clean recording is 92–97%; on phone audio or a noisy meeting it drops to 75–90%. DeluxeScribe transcribes MP3 files in 99 languages, with speaker labels and clickable timestamps; a 1-hour MP3 typically completes in 5–10 minutes. Free tier is 60 minutes; paid plans start at $10/month. Below: the four real ways to do this (including the free options most pages hide), what accuracy you should actually expect, and why bitrate barely matters.
  • 60 minutes free
  • No credit card
  • 99 languages
  • Speaker labels

Last verified June 23, 2026

The 4 ways to convert MP3 to text (free and paid)

Ranked by what most people actually need: an honest tradeoff between cost, quality, setup time, and privacy.

MethodCostQualitySetupPrivacy
DeluxeScribe (or any modern AI service)Free trial → $10/moHighNoneCloud-encrypted
YouTube unlisted uploadFreeMediumGoogle accountPublic-with-URL
Self-hosted WhisperFreeHighPython installLocal
Apple Voice Memos / Live CaptionsFreeMedium-highiPhone or MacOn-device

1. DeluxeScribe (or any modern AI transcription service)

Upload the MP3, wait 5–10 minutes per hour of audio, download in any of six formats. Best for: lots of files, multi-language content, anything you need speaker labels for, or files where you want a clickable transcript editor instead of plain text. DeluxeScribe’s free tier (60 minutes, no card) is enough to transcribe a typical interview series or podcast episode without paying.

2. YouTube unlisted upload + caption download

Upload your MP3 to YouTube as “unlisted” (convert to MP4 with a static image first — ffmpeg one-liner below), wait for auto-captions to finish, then download the SRT. Free, useful for short clips, but two catches: “unlisted” videos are still publicly accessible to anyone with the URL, and YouTube’s captions are noticeably less accurate than modern dedicated ASR.

ffmpeg -loop 1 -i still.jpg -i audio.mp3 -c:v libx264 \
  -c:a copy -shortest video.mp4

3. Self-hosted Whisper

OpenAI’s Whisper model is free and open. With Python:

pip install openai-whisper
whisper recording.mp3 --model large-v3 --output_format txt

Fully private (audio never leaves your machine), high accuracy, and supports the same 99 languages as commercial services. The real cost is your own time and hardware — a 1-hour MP3 takes 10–30 hours on a CPU, ~5 minutes on a recent GPU. Worth it for recurring use or sensitive content; overkill for a one-off 10-minute clip.

4. Apple Voice Memos / Live Captions

On an iPhone running iOS 18 or later, the Voice Memos app transcribes recordings on-device for English. Files made elsewhere can be opened in Voice Memos and transcribed the same way. Limits: only English (officially), no speaker labels, no SRT export. For a quick English-only voice memo transcript, it’s the easiest free option.

Accuracy: what you can actually expect

Every transcription product claims “99% accurate.” None of them are, in the general case. Accuracy depends entirely on the recording conditions — the same ASR model will hit 97% on a podcast and 75% on a phone call. Here’s a realistic range:

Recording typeRealistic accuracyNotes
Studio podcast (single mic, post-production)95–98%Best case for ASR.
Zoom meeting (good headset)90–95%Compression artefacts cost a few points.
Phone interview / VoIP80–90%Bandlimited audio drops harder consonants.
Lecture hall from back row75–85%Reverb and distance hurt more than noise.
Outdoor or on-the-street60–80%Wind and traffic dominate. Edit time exceeds typing it.

The honest takeaway: AI transcription is great as a first draft for most content. Expect to spend 10–20% of the audio runtime reviewing and correcting — much less than typing from scratch, but not zero.

Does MP3 bitrate matter? (Mostly no)

The marketing pages that warn you about “quality loss from low bitrate” are mostly trying to upsell you into a premium plan. Here’s the actual engineering:

Modern ASR models (Whisper, Conformer, wav2vec) operate on mel-spectrograms sampled at 16 kHz. That captures everything from 0 to 8 kHz, which covers all of human speech. MP3 at 64 kbps and above preserves that frequency range losslessly in the speech band — the lossy compression mostly throws away high-frequency content above 12 kHz that the model never looks at anyway.

Practical result: 320 kbps vs 128 kbps vs 64 kbps produces roughly the same transcription accuracy (within 1%) on the same source recording. The thing that actually matters is the source recording quality, not the compression applied afterward.

When to actually worry: MP3s encoded below 32 kbps, often generated by old voicemail systems or analog-to-digital phone recorders. At that bitrate, even mid-frequencies start to degrade. If you have control over the source, record at 96 kbps or higher and don’t re-encode.

Export formats (and which to pick)

FormatBest forIncludes timestamps
TXTPasting into a doc, searchNo (or inline)
DOCXSharing with non-technical usersOptional
PDFArchive, deliverableOptional
SRTVideo subtitlesYes (cue-level)
VTTHTML5 video captionsYes (cue-level)
JSONProgrammatic use, custom pipelinesYes (word-level)

For most people: DOCX for sharing, JSON if you’re building anything programmatic, SRT/VTT if you’re going to video. TXT and PDF are useful but less flexible.

Common failures and how to fix them

“Silent” MP3

The most common cause is a container/codec mismatch — the file has the .mp3 extension but actually contains AAC, Opus, or no audio stream at all. Check with ffprobe yourfile.mp3. If it’s not actually MP3, re-encode: ffmpeg -i input.xxx -acodec libmp3lame output.mp3.

M4P confusion

.m4pis a different format: AAC audio with FairPlay DRM, used by some legacy iTunes Store purchases. It’s not the same as MP3 or M4A and most transcription services reject it. Strip the DRM by converting in iTunes (when allowed) or re-record.

File too large

Most services cap individual uploads at 1–5 GB. For longer content, split with ffmpeg into smaller chunks:

ffmpeg -i long.mp3 -c copy -segment_time 1800 -f segment chunk_%03d.mp3

Wrong language detected

Bilingual recordings (English + Spanish in the same file) confuse auto-detection. Either set the language manually in the transcription settings, or split the file by language section first.

When DeluxeScribe is the right tool

The right answer is us when you have multiple files, need speaker labels, want exportable formats beyond plain text, work in a language other than English, or want a transcript editor to clean up the result. Our 60-minute free tier covers a typical interview series or podcast episode without paying anything.

The right answer isn’t uswhen the content is sensitive enough that an encrypted cloud upload is still too much risk (run Whisper locally), or when it’s a 90-second voice memo (use Apple’s built-in for free).

Transcribe MP3s in 99 languages

60 minutes free, no credit card. Speaker labels, six export formats, batch upload.

How this page was verified

Accuracy ranges in the recording-type table are from our own tests across 12 source files of each type, scored against human-corrected transcripts using word error rate (WER). The same WER methodology is described in the OpenAI Whisper paper (Radford et al., 2022). Public benchmarks for cross-bitrate ASR accuracy aren’t standardised — our claim that bitrate barely affects MP3 transcription is based on the audio engineering principle that modern ASR models operate on 16 kHz mel-spectrograms, which MP3 above 64 kbps preserves losslessly in the relevant frequency band. Apple Voice Memos behaviour is from Apple’s official documentation.

Frequently Asked Questions

Is MP3 to text really free?

Genuinely free options exist: DeluxeScribe's 60-minute trial, self-hosted Whisper if you can run a Python command, and Apple Voice Memos for iPhone-recorded MP3s under iOS 18+. Most other 'free' tools cap at 5 minutes or watermark the output.

How long does a 1-hour MP3 take to transcribe?

A modern cloud service typically completes a 1-hour MP3 in 5–10 minutes. Self-hosted Whisper on CPU takes 10–30 hours; on a recent GPU, 5–15 minutes. Real-time transcription tools take exactly 60 minutes.

Will it work on a phone-recorded MP3?

Yes, but expect lower accuracy. Phone audio is bandlimited (300–3400 Hz) and compressed for voice, which strips information the ASR model uses. Realistic accuracy is 80–90% on a phone interview vs 95–98% on the same content recorded with a podcast mic.

Can I transcribe an MP3 in another language?

Yes. DeluxeScribe supports 99 languages with automatic detection. Quality varies — English, Spanish, French, German, Mandarin, and Japanese are best. Less-resourced languages (Welsh, Tagalog, Yoruba) have higher error rates but still produce usable transcripts for most use cases.

Does it work offline?

Cloud services don't. Self-hosted Whisper does — install once, run forever without internet. Apple's Voice Memos on iOS 18+ also runs on-device for English. For sensitive content where uploading is a non-starter, offline Whisper is the realistic answer.

Is my audio kept private?

DeluxeScribe processes audio on encrypted infrastructure and doesn't use customer content to train models. Files can be deleted at any time. For maximum privacy, run Whisper locally — your audio never leaves your machine.

Can I transcribe multiple MP3s at once?

Yes. DeluxeScribe accepts batch uploads from the dashboard; each file runs in parallel. For 10+ files, the bulk-upload flow is faster than manual one-by-one.