How to Transcribe Audio: Every Way That Actually Works in 2026
Four paths, a decision tree to pick the right one, and honest accuracy expectations by audio condition.
- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified June 25, 2026
TL;DR — pick your path
Match the row that fits your situation to the recommended path. All sections below explain each path in detail.
| What you have / want | Recommended path | Cost |
|---|---|---|
| An MP3 from a podcast or meeting | SaaS service · see MP3 to Text | Free tier → ~$10/mo |
| iPhone voice memo (recording you made) | iOS 18+ built-in or service · see Voice Memo to Text | Free → $10/mo |
| iPhone voicemail (someone left it for you) | Carrier Visual Voicemail or service · see Voicemail to Text | Free → $10/mo |
| Zoom recording, MP4 video file | SaaS service · see MP4 to Text | Free tier → ~$10/mo |
| M4A file (any source) | SaaS service · see M4A to Text | Free tier → ~$10/mo |
| Podcast episode (yours or someone else’s) | See Podcast Transcription | Free tier → ~$10/mo |
| Need subtitle files (SRT/VTT) | See SRT Generator | Free tier → ~$10/mo |
| Sensitive audio — can’t upload to cloud | Self-hosted Whisper | Free |
| Building transcription into your own product | AssemblyAI or Deepgram API | $0.12-0.40/hr |
The four paths, compared
Path 1 — SaaS transcription services
Upload audio in a browser, get a transcript back in minutes. The easiest path for most use cases. Free tiers exist; paid plans start around $10/month.
When SaaS wins: you have multiple files, you need a language other than English, you want speaker labels without setup, you need an in-browser editor to fix errors, or you want exports in specific formats (SRT/VTT for video, DOCX/PDF for sharing).
Honest tool comparison:
| Service | Free tier | Paid from | Best for |
|---|---|---|---|
| DeluxeScribe | 60 min one-time | $10/mo · 1,200 min | Multi-language, cheapest per minute, batch jobs |
| Otter | 300 min/mo | $17/mo | Meeting-style use, calendar integration |
| Rev (AI tier) | None | $0.25/min | Pay-per-minute, optional human review tier ($1.50/min) |
| Descript | 1 hour/mo | $24/mo · 30 hours | Edit audio by editing the transcript text |
| Trint | None | $48/mo · 7 hours | Newsroom-style editor for journalists |
Pricing captured June 2026.Verify on each vendor’s pricing page before committing.
Path 2 — Free web tools
Sites like audiototext.com, freepodcasttranscription.com, audioconvert.ai, and dozens of others offer browser-based transcription with no signup. They’re convenient for a one-off 5-minute file.
The catch:“free” almost always means a paywall the upload widget doesn’t mention. Common patterns include a 10-30 minute file-length cap, a transcript-length truncation after the free portion, watermarked output, or a paywall that appears only after you’ve waited for processing.
When free web tools work:short English recordings under 10 minutes, you don’t need exports beyond plain text, you don’t care about speaker labels, and you’ll only do this occasionally.
Path 3 — Self-hosted Whisper
OpenAI’s Whisper model runs locally with one command if you have Python. The audio never leaves your machine.
pip install openai-whisper whisper recording.mp3 --model large-v3 --output_format srt
Model size vs accuracy tradeoff: tiny and base are fast but rough; medium is a good default; large-v3 is state-of-the-art accuracy but ~10× slower on CPU.
Speed reality:on a typical CPU, expect 10-30× real-time (a 1-hour file takes 10-30 hours). On a recent GPU it’s real-time or faster. Worth it for sensitive content; not worth it for a single 5-minute clip unless you already have the setup.
When Whisper wins:medical, legal, or confidential audio that can’t be uploaded; you process many files and have hardware; you need a specific Whisper variant (WhisperX for word-level timestamps + diarization, faster-whisper for speed).
Path 4 — Native OS options
Your operating system probably has built-in transcription you don’t have to pay for. They’re language-limited and don’t do everything a SaaS service does, but for simple cases they’re free.
- iPhone Voice Memos (iOS 18+) — on-device transcription of new recordings, 10 supported languages, iPhone 12 or newer. See our voice memo guide.
- macOS Voice Memos — same engine as iOS, syncs via iCloud, works on Apple Silicon Macs.
- Microsoft Word Transcribe — included with Microsoft 365 subscription, web-only, supports MP3/WAV/M4A/MP4, English plus several other languages. Records live or accepts file upload.
- Google Docs voice typing — live transcription only (no file upload), 100+ languages, free with a Google account. Best for live note-taking, not recorded files.
- Pixel Recorder (Android) — on-device transcription on Pixel 3+, English plus other languages depending on Pixel gen.
How accurate is AI audio transcription, really?
Every vendor’s landing page claims “99% accuracy.” That number is real — on the easiest possible condition. On everything else, accuracy drops in predictable ways.
Word Error Rate (WER)is the standard metric: the percentage of words the transcript gets wrong (substitutions, insertions, and deletions combined). A 5% WER means 5 out of every 100 words are wrong — readable, but you’ll see errors. A 25% WER means 1 in 4 words is wrong — you have to re-listen to half the recording to be sure of anything.
| Audio condition | Typical WER | Reads like |
|---|---|---|
| Studio mic, one speaker, no music, clear English | 2-5% | Almost perfect; light proofreading |
| Conference room, 2-3 speakers, good mics | 5-10% | Useful as-is; spot-check names |
| Zoom call, mixed mics, 4+ speakers | 10-20% | Readable, requires editing for publication |
| Phone call (8 kHz narrow-band codec) | 15-30% | Gist is clear, details unreliable |
| Heavy accent + background noise | 20-40% | Need to re-listen to verify key points |
| Music + speech overlap | Often fails | Model may hallucinate lyrics or skip sections |
The errors look the same across providers
Whatever service you use, the same categories of words get mis-heard:
- Proper nouns— people’s names, business names, street names, product names
- Phone numbers — particularly when spoken fast or with non-standard groupings
- Technical jargon — drug names, legal terms, scientific terminology
- Homophones— their/there/they’re, two/to/too, principal/principle
- Numbers with units— “15 milligrams” vs “50 milligrams,” “eighty” vs “18”
Always spot-check these regardless of how accurate the provider claims to be.
A 5-step workflow that works
1. Pick the right tool for the audio condition
Use the decision table above. The wrong tool for the wrong audio (free web tool on a 2-hour multi-speaker meeting; SaaS for a 30-second voice memo) wastes time and money.
2. Upload the original file, not a compressed re-export
If you exported the recording from one app and re-encoded it (e.g., made an MP3 of an MP3), accuracy drops measurably. Use the original file when possible. If you have to convert, convert to lossless WAV or FLAC rather than re-encoding to another lossy format.
3. Enable speaker labels only if you have 2+ speakers
Diarization (figuring out who said what) costs a small amount of accuracy and adds processing time. For solo recordings — voice memos, narration, single-speaker lectures — skip it. For interviews, meetings, and panel discussions, you need it, but expect 5-15% speaker error rate (wrong attributions) even on good services.
4. Spot-check the high-error parts
Read through the transcript looking specifically at proper nouns, phone numbers, technical terms, and numbers with units. These are the most-mis-heard categories regardless of provider. Fix them in the editor before relying on the transcript for anything that matters.
5. Export to the right format
- TXT — simple sharing, copy-paste, email
- DOCX — editing in Word, formatted quoting, handing off to an editor
- PDF — distribution where formatting should be locked
- SRT or VTT — subtitle files for video. See our SRT generator guide.
- JSON — structured data with word-level timestamps, for tooling pipelines or building your own UI
Privacy and where your audio is processed
What happens to your audio depends entirely on which path you choose:
- Cloud SaaS— audio uploaded to the vendor’s servers, processed there, transcript returned. Some vendors retain the audio and transcripts indefinitely; some delete after processing; some let you configure retention. Always check the vendor’s data retention policy.
- Free web tools— same as SaaS in terms of upload, but data handling is often less transparent. If the site doesn’t publish a privacy policy, assume the worst.
- Self-hosted Whisper — nothing leaves your machine. The only privacy concern is local file access and disk encryption.
- Native OS— on-device options (iPhone, Pixel, M-series Mac) process locally. Microsoft Word Transcribe uploads to Microsoft servers; Google Docs voice typing processes through Google. Check each platform’s docs.
DeluxeScribe’s privacy stance
Audio is encrypted in transit (TLS) and at rest. We don’t use your audio or transcripts to train models. You can delete recordings and transcripts at any time. We are not HIPAA-compliant — do not upload Protected Health Information. For PHI, use a vendor with a signed BAA, or self-host Whisper. For attorney-client privileged content, check with your firm before uploading to any cloud service.
GDPR considerations
If your audio contains personal data of EU residents, you’re a data controller and need a lawful basis for processing. Most cloud transcription services act as processors under your control. Some publish a Data Processing Addendum (DPA); ask if you can’t find one.
For developers — APIs to know
If you’re building transcription into your own product rather than transcribing for personal use, the SaaS landing pages aren’t what you want. Here are the APIs to evaluate:
| Provider | Pricing (June 2026) | Notable strength |
|---|---|---|
| AssemblyAI | $0.12-0.37/hr | Generous free credits, strong diarization, good DX |
| Deepgram | $0.12-0.43/hr (Nova-3) | Lowest latency, real-time streaming, telephony focus |
| OpenAI Whisper API | $0.006/min | Whisper hosted; simple drop-in |
| Speechmatics | Custom pricing | Best-in-class accuracy on accented English |
| AWS Transcribe | $1.44/hr standard | Native AWS integration, large enterprise contracts |
| Google Cloud Speech-to-Text | $0.024/min Chirp-2 | Integrates with Google Cloud stack |
| Azure Speech | $1.00/hr standard | Microsoft enterprise integration |
DeluxeScribe is not an API.If you’re building a product, use AssemblyAI or Deepgram. If you need real-time streaming for telephony, Deepgram. If you want the simplest possible integration and don’t care about advanced features, the OpenAI Whisper API. For everything else, AssemblyAI is the safe default.
By source — specific guides
This page is the overview. For specific file formats or recording sources, we have dedicated guides with the exact steps, gotchas, and tools for each:
- MP3 to Text — the most common audio format; accuracy by recording type; free options
- M4A to Text — Apple’s default format; iOS 18 built-in vs export path
- MP4 to Text — video files; ffmpeg audio extraction; YouTube workaround
- Voice Memo to Text — iPhone Voice Memos app; iOS 18 built-in; older device workarounds
- Voicemail to Text — incoming voicemails; Visual Voicemail; carrier-specific setup
- Podcast Transcription — listener and podcaster workflows; Podcasting 2.0 spec; show notes
- SRT Generator — subtitle file creation; timing rules; SRT vs VTT
How this page was verified
Related guides
- MP3 to TextFormat-specific guide for the most common audio format. Free options that actually work.
- M4A to TextApple's format. Covers iOS 18 built-in transcription and the export path for older devices.
- Voice Memo to TextRecordings you made yourself on iPhone or Android. Different gotchas than file-format transcription.
- Podcast TranscriptionListener and podcaster workflows, the Podcasting 2.0 spec, and the show-notes pipeline.