How to Transcribe Audio: Every Way That Actually Works in 2026

Four paths, a decision tree to pick the right one, and honest accuracy expectations by audio condition.

Four paths exist: upload to a SaaS service (easiest, paid above a free tier), use a free web tool (good for one-off short files, hidden paywalls common), run Whisper on your own machine(free, requires Python), or use your operating system’s built-in transcription (macOS Voice Memos, Microsoft Word, Pixel Recorder). Pick based on the file you have, the accuracy you need, and whether you can upload the audio. DeluxeScribe is one of the SaaS options — 99 languages, 60 minutes free, no credit card. Below: the decision tree, honest accuracy by condition (not the usual “99%” blanket claim), the 5-step workflow, and links to format-specific guides for your exact case.
  • 60 minutes free
  • No credit card
  • 99 languages
  • Speaker labels

Last verified June 25, 2026

TL;DR — pick your path

Match the row that fits your situation to the recommended path. All sections below explain each path in detail.

What you have / wantRecommended pathCost
An MP3 from a podcast or meetingSaaS service · see MP3 to TextFree tier → ~$10/mo
iPhone voice memo (recording you made)iOS 18+ built-in or service · see Voice Memo to TextFree → $10/mo
iPhone voicemail (someone left it for you)Carrier Visual Voicemail or service · see Voicemail to TextFree → $10/mo
Zoom recording, MP4 video fileSaaS service · see MP4 to TextFree tier → ~$10/mo
M4A file (any source)SaaS service · see M4A to TextFree tier → ~$10/mo
Podcast episode (yours or someone else’s)See Podcast TranscriptionFree tier → ~$10/mo
Need subtitle files (SRT/VTT)See SRT GeneratorFree tier → ~$10/mo
Sensitive audio — can’t upload to cloudSelf-hosted WhisperFree
Building transcription into your own productAssemblyAI or Deepgram API$0.12-0.40/hr

The four paths, compared

Path 1 — SaaS transcription services

Upload audio in a browser, get a transcript back in minutes. The easiest path for most use cases. Free tiers exist; paid plans start around $10/month.

When SaaS wins: you have multiple files, you need a language other than English, you want speaker labels without setup, you need an in-browser editor to fix errors, or you want exports in specific formats (SRT/VTT for video, DOCX/PDF for sharing).

Honest tool comparison:

ServiceFree tierPaid fromBest for
DeluxeScribe60 min one-time$10/mo · 1,200 minMulti-language, cheapest per minute, batch jobs
Otter300 min/mo$17/moMeeting-style use, calendar integration
Rev (AI tier)None$0.25/minPay-per-minute, optional human review tier ($1.50/min)
Descript1 hour/mo$24/mo · 30 hoursEdit audio by editing the transcript text
TrintNone$48/mo · 7 hoursNewsroom-style editor for journalists

Pricing captured June 2026.Verify on each vendor’s pricing page before committing.

Path 2 — Free web tools

Sites like audiototext.com, freepodcasttranscription.com, audioconvert.ai, and dozens of others offer browser-based transcription with no signup. They’re convenient for a one-off 5-minute file.

The catch:“free” almost always means a paywall the upload widget doesn’t mention. Common patterns include a 10-30 minute file-length cap, a transcript-length truncation after the free portion, watermarked output, or a paywall that appears only after you’ve waited for processing.

When free web tools work:short English recordings under 10 minutes, you don’t need exports beyond plain text, you don’t care about speaker labels, and you’ll only do this occasionally.

Path 3 — Self-hosted Whisper

OpenAI’s Whisper model runs locally with one command if you have Python. The audio never leaves your machine.

pip install openai-whisper
whisper recording.mp3 --model large-v3 --output_format srt

Model size vs accuracy tradeoff: tiny and base are fast but rough; medium is a good default; large-v3 is state-of-the-art accuracy but ~10× slower on CPU.

Speed reality:on a typical CPU, expect 10-30× real-time (a 1-hour file takes 10-30 hours). On a recent GPU it’s real-time or faster. Worth it for sensitive content; not worth it for a single 5-minute clip unless you already have the setup.

When Whisper wins:medical, legal, or confidential audio that can’t be uploaded; you process many files and have hardware; you need a specific Whisper variant (WhisperX for word-level timestamps + diarization, faster-whisper for speed).

Path 4 — Native OS options

Your operating system probably has built-in transcription you don’t have to pay for. They’re language-limited and don’t do everything a SaaS service does, but for simple cases they’re free.

  • iPhone Voice Memos (iOS 18+) — on-device transcription of new recordings, 10 supported languages, iPhone 12 or newer. See our voice memo guide.
  • macOS Voice Memos — same engine as iOS, syncs via iCloud, works on Apple Silicon Macs.
  • Microsoft Word Transcribe — included with Microsoft 365 subscription, web-only, supports MP3/WAV/M4A/MP4, English plus several other languages. Records live or accepts file upload.
  • Google Docs voice typing — live transcription only (no file upload), 100+ languages, free with a Google account. Best for live note-taking, not recorded files.
  • Pixel Recorder (Android) — on-device transcription on Pixel 3+, English plus other languages depending on Pixel gen.

How accurate is AI audio transcription, really?

Every vendor’s landing page claims “99% accuracy.” That number is real — on the easiest possible condition. On everything else, accuracy drops in predictable ways.

Word Error Rate (WER)is the standard metric: the percentage of words the transcript gets wrong (substitutions, insertions, and deletions combined). A 5% WER means 5 out of every 100 words are wrong — readable, but you’ll see errors. A 25% WER means 1 in 4 words is wrong — you have to re-listen to half the recording to be sure of anything.

Audio conditionTypical WERReads like
Studio mic, one speaker, no music, clear English2-5%Almost perfect; light proofreading
Conference room, 2-3 speakers, good mics5-10%Useful as-is; spot-check names
Zoom call, mixed mics, 4+ speakers10-20%Readable, requires editing for publication
Phone call (8 kHz narrow-band codec)15-30%Gist is clear, details unreliable
Heavy accent + background noise20-40%Need to re-listen to verify key points
Music + speech overlapOften failsModel may hallucinate lyrics or skip sections

The errors look the same across providers

Whatever service you use, the same categories of words get mis-heard:

  • Proper nouns— people’s names, business names, street names, product names
  • Phone numbers — particularly when spoken fast or with non-standard groupings
  • Technical jargon — drug names, legal terms, scientific terminology
  • Homophones— their/there/they’re, two/to/too, principal/principle
  • Numbers with units— “15 milligrams” vs “50 milligrams,” “eighty” vs “18”

Always spot-check these regardless of how accurate the provider claims to be.

A 5-step workflow that works

1. Pick the right tool for the audio condition

Use the decision table above. The wrong tool for the wrong audio (free web tool on a 2-hour multi-speaker meeting; SaaS for a 30-second voice memo) wastes time and money.

2. Upload the original file, not a compressed re-export

If you exported the recording from one app and re-encoded it (e.g., made an MP3 of an MP3), accuracy drops measurably. Use the original file when possible. If you have to convert, convert to lossless WAV or FLAC rather than re-encoding to another lossy format.

3. Enable speaker labels only if you have 2+ speakers

Diarization (figuring out who said what) costs a small amount of accuracy and adds processing time. For solo recordings — voice memos, narration, single-speaker lectures — skip it. For interviews, meetings, and panel discussions, you need it, but expect 5-15% speaker error rate (wrong attributions) even on good services.

4. Spot-check the high-error parts

Read through the transcript looking specifically at proper nouns, phone numbers, technical terms, and numbers with units. These are the most-mis-heard categories regardless of provider. Fix them in the editor before relying on the transcript for anything that matters.

5. Export to the right format

  • TXT — simple sharing, copy-paste, email
  • DOCX — editing in Word, formatted quoting, handing off to an editor
  • PDF — distribution where formatting should be locked
  • SRT or VTT — subtitle files for video. See our SRT generator guide.
  • JSON — structured data with word-level timestamps, for tooling pipelines or building your own UI

Run this workflow on a real file

60 minutes free, no credit card. Drop any MP3, M4A, MP4, WAV, or 17 other formats and get a transcript with speaker labels, timestamps, and 6 export options.

Privacy and where your audio is processed

What happens to your audio depends entirely on which path you choose:

  • Cloud SaaS— audio uploaded to the vendor’s servers, processed there, transcript returned. Some vendors retain the audio and transcripts indefinitely; some delete after processing; some let you configure retention. Always check the vendor’s data retention policy.
  • Free web tools— same as SaaS in terms of upload, but data handling is often less transparent. If the site doesn’t publish a privacy policy, assume the worst.
  • Self-hosted Whisper — nothing leaves your machine. The only privacy concern is local file access and disk encryption.
  • Native OS— on-device options (iPhone, Pixel, M-series Mac) process locally. Microsoft Word Transcribe uploads to Microsoft servers; Google Docs voice typing processes through Google. Check each platform’s docs.

DeluxeScribe’s privacy stance

Audio is encrypted in transit (TLS) and at rest. We don’t use your audio or transcripts to train models. You can delete recordings and transcripts at any time. We are not HIPAA-compliant — do not upload Protected Health Information. For PHI, use a vendor with a signed BAA, or self-host Whisper. For attorney-client privileged content, check with your firm before uploading to any cloud service.

GDPR considerations

If your audio contains personal data of EU residents, you’re a data controller and need a lawful basis for processing. Most cloud transcription services act as processors under your control. Some publish a Data Processing Addendum (DPA); ask if you can’t find one.

For developers — APIs to know

If you’re building transcription into your own product rather than transcribing for personal use, the SaaS landing pages aren’t what you want. Here are the APIs to evaluate:

ProviderPricing (June 2026)Notable strength
AssemblyAI$0.12-0.37/hrGenerous free credits, strong diarization, good DX
Deepgram$0.12-0.43/hr (Nova-3)Lowest latency, real-time streaming, telephony focus
OpenAI Whisper API$0.006/minWhisper hosted; simple drop-in
SpeechmaticsCustom pricingBest-in-class accuracy on accented English
AWS Transcribe$1.44/hr standardNative AWS integration, large enterprise contracts
Google Cloud Speech-to-Text$0.024/min Chirp-2Integrates with Google Cloud stack
Azure Speech$1.00/hr standardMicrosoft enterprise integration

DeluxeScribe is not an API.If you’re building a product, use AssemblyAI or Deepgram. If you need real-time streaming for telephony, Deepgram. If you want the simplest possible integration and don’t care about advanced features, the OpenAI Whisper API. For everything else, AssemblyAI is the safe default.

By source — specific guides

This page is the overview. For specific file formats or recording sources, we have dedicated guides with the exact steps, gotchas, and tools for each:

  • MP3 to Text — the most common audio format; accuracy by recording type; free options
  • M4A to Text — Apple’s default format; iOS 18 built-in vs export path
  • MP4 to Text — video files; ffmpeg audio extraction; YouTube workaround
  • Voice Memo to Text — iPhone Voice Memos app; iOS 18 built-in; older device workarounds
  • Voicemail to Text — incoming voicemails; Visual Voicemail; carrier-specific setup
  • Podcast Transcription — listener and podcaster workflows; Podcasting 2.0 spec; show notes
  • SRT Generator — subtitle file creation; timing rules; SRT vs VTT

How this page was verified

Word Error Rate (WER) ranges in the accuracy table come from published benchmarks on the LibriSpeech corpus (clean studio audio) and the AMI Meeting Corpus (multi-speaker meeting audio), generalized across modern transformer-based ASR models. Whisper command-line behavior comes from the OpenAI Whisper GitHub repository. Apple Voice Memos transcription requirements come from Apple Support. Microsoft Word Transcribe is documented at Microsoft Support. API pricing was captured June 2026 from AssemblyAI and Deepgram pricing pages. We don’t cite the vendor “99% accuracy” claims that appear across competitor copy because they aren’t sourced to a published study.

Frequently Asked Questions

What's the best way to transcribe audio?

It depends on your file and your needs. For a one-off short English recording, your OS probably has it built-in (macOS Voice Memos on iOS 18+, Microsoft Word Transcribe for 365 subscribers, Pixel Recorder on Pixel phones). For multi-language, multi-speaker, or longer files, a SaaS service is faster and more accurate. For sensitive content you don't want to upload, run Whisper locally. For building a product, use an API like AssemblyAI or Deepgram.

How accurate is AI audio transcription?

On clean studio audio with one English speaker, modern services hit 95-98% word accuracy. On a Zoom call with 4 speakers and mixed mics, expect 80-90%. On phone call audio (narrow-band codec), 70-85%. On audio with music or heavy background noise, often much worse. The '99% accurate' marketing claim is real only on the easiest condition — published WER benchmarks back this up.

Can I transcribe audio for free?

Yes. macOS Voice Memos on iOS 18+ is free and on-device. Pixel Recorder on Pixel phones is free. Microsoft 365 subscribers get Word's Transcribe feature included. Self-hosted Whisper is free (requires Python). Most SaaS services offer a free tier — DeluxeScribe gives 60 minutes free, no credit card. The 'free' free-web-tools commonly cap at 5-10 minutes before paywalling, which their landing pages don't make obvious.

What audio formats can be transcribed?

Most modern services accept MP3, WAV, M4A, AAC, OGG, FLAC, OPUS for audio, and MP4, MOV, AVI, MKV, WebM for video (audio track extracted automatically). DeluxeScribe accepts 20+ formats. The original file format barely affects accuracy — what matters is the audio quality at the time of recording, not the container.

Does my audio get sent to a third party?

With any cloud SaaS service, yes — your audio is uploaded and processed on the vendor's servers. Some store the audio and transcripts; some delete after processing. Check each vendor's data retention policy. For audio you can't upload (medical records, attorney-client privileged content, NDA-covered recordings), use self-hosted Whisper — nothing leaves your machine. DeluxeScribe encrypts uploads in transit, doesn't use your audio to train models, but is not HIPAA-compliant for clinical use.

How long does AI transcription take?

Cloud services typically finish a 1-hour file in 3-10 minutes. Self-hosted Whisper on a CPU is much slower (10-30× real-time, so a 1-hour file takes 10-30 hours). On a recent GPU it's near real-time. Native OS transcription on iPhone/Pixel happens in the background while you continue using your phone, taking roughly as long as the audio itself.

Do I need speaker labels (diarization)?

Only if you have 2 or more speakers. Speaker labels cost a small amount of accuracy and add processing time. For solo recordings (voice memos, lectures with a single speaker, narration), skip them. For interviews, meetings, and panel discussions, you need them — but expect 5-15% speaker error rate (wrong attributions) even on good services, higher on hybrid in-room / remote recordings.

What's the difference between SaaS, free tools, and self-hosted Whisper?

SaaS (DeluxeScribe, Otter, Rev): easiest, fastest, supports many languages, costs money beyond a free tier. Free web tools (audiototext.com and similar): work for one-off short files, usually have hidden paywalls and shorter limits. Self-hosted Whisper: free forever, fully private (no upload), requires a Python command and decent hardware. Native OS (Apple Voice Memos, Microsoft Word, Pixel Recorder): free, easy, but limited language support and bound to specific platforms.