How Accurate Is OpenAI Whisper in 2026? (Real-World WER by Condition and Language)
Whisper's accuracy varies by 10× between conditions. Here's the table nobody publishes — plus when a commercial ASR is the right call instead.
Last verified June 29, 2026
TL;DR — the headline numbers
One number can’t describe Whisper’s accuracy honestly. Here are the ranges by audio condition for English:
| Audio condition | Typical WER | Reads like |
|---|---|---|
| Studio English (LibriSpeech audiobook) | 2-5% | Near-perfect; light proofreading only |
| Clean podcast / well-mic'd single speaker | 3-7% | Publication-grade after light cleanup |
| Conference room meeting, good mics | 5-10% | Useful as-is; spot-check names |
| Zoom / Teams call with mixed mics | 10-15% | Readable; edit before publication |
| Phone call audio (8 kHz narrow-band codec) | 15-25% | Gist clear; numbers and names unreliable |
| Heavy accent + background noise | 20-30% | Verify key points against audio |
| Music overlapping with speech | 30%+ (often fails) | Model may hallucinate lyrics |
| Multilingual: top languages | 5-15% | Comparable to English on clean audio |
| Multilingual: low-resource languages | 30%+ | Often unusable; check alternatives |
What “accurate” means — Word Error Rate explained
Word Error Rate (WER) is the standard accuracy metric for speech-to-text. It measures the percentage of words the transcript gets wrong, counting substitutions, insertions, and deletions equally.
- 5% WER = 5 wrong words per 100 = reads cleanly; light proofreading
- 10% WER = 1 in 10 wrong = readable but edit-required before publication
- 25% WER = 1 in 4 wrong = unreliable for direct quoting; you need to re-listen
The metric’s blind spot:WER weights every error equally. One missed phone number in a voicemail is functionally worse than 10 missed filler words, but WER doesn’t see that. Always spot-check proper nouns, numbers, and named entities — they’re the highest-impact error sites regardless of which ASR you use.
Whisper model sizes — accuracy/speed tradeoff
OpenAI publishes Whisper in six sizes. Production deployments mostly use large-v3 or large-v3-turbo; smaller sizes are for edge deployment or hardware-constrained scenarios.
| Model | Parameters | VRAM | Relative speed | Use when |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32× real-time on CPU | Toy demos; not production |
| base | 74M | ~1 GB | ~16× real-time on CPU | Edge / mobile prototypes |
| small | 244M | ~2 GB | ~6× real-time on CPU | Resource-constrained server |
| medium | 769M | ~5 GB | ~2× real-time on CPU | Acceptable production fallback |
| large-v3 | 1.55B | ~10 GB | ~1× real-time on GPU | Maximum accuracy; default for production |
| large-v3-turbo | 809M | ~6 GB | ~8× real-time on GPU | Best general-purpose; ~1% WER trade for 8× speed |
Most production systems use large-v3-turbo — the speed gain (8×) justifies the small accuracy cost (~1% WER) in almost every batch-transcription scenario.
WER by audio condition (the table nobody else publishes)
Whisper’s headline benchmark numbers are honest but misleading — they reflect the LibriSpeech corpus (clean-read audiobook English). Real audio looks very different. Here’s the breakdown by realistic condition, with the reason each one degrades accuracy:
| Condition | Typical WER | Why it degrades |
|---|---|---|
| Studio podcast (single speaker, $300+ mic) | 2-5% | Baseline — closest to LibriSpeech training distribution |
| Conference room meeting (good ceiling mics) | 5-10% | Multi-speaker turn-taking; some cross-talk |
| Zoom / Teams call (consumer mics) | 10-15% | Codec compression, mixed mic quality, occasional overlap |
| Cellular phone audio (G.711, AMR) | 15-25% | 8 kHz narrow-band sampling strips consonant frequencies |
| Voice memo on iPhone (close-mic, single speaker) | 5-12% | Close-mic helps; phone's ambient noise hurts |
| Field recording with wind / traffic | 15-30% | Low signal-to-noise ratio confuses the model |
| Lecture hall (distant mic, large room) | 10-20% | Reverberation and audience noise |
| Music + speech overlap (interview with score) | 30%+ | Model often hallucinates lyrics or skips sections |
| Multi-speaker overlap (3+ talking simultaneously) | 20-40% | Whisper has no diarization; output garbles |
The takeaway:if your audio is studio quality, expect Whisper’s headline numbers. If it’s anything else, plan for 2-5× higher error rates. Phone numbers, named entities, and technical jargon are mis-heard at higher rates than common words regardless of condition.
WER by language
Whisper supports 99 languages, but accuracy varies sharply. The OpenAI Whisper paper publishes WER on the Fleurs benchmark by language; the grouped summary:
Top tier (under 5% WER on Fleurs)
English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Russian, Chinese (Mandarin). These languages have abundant training data and Whisper performs near-English levels on clean audio.
Mid tier (5-15% WER)
Arabic, Hindi, Turkish, Dutch, Polish, Vietnamese, Indonesian, Hebrew, Greek, Czech, Swedish, Danish, Finnish, Norwegian, Romanian, Bulgarian, Ukrainian, Thai, Catalan. Production-usable with editing.
Low tier (15-30%+ WER)
Lower-resource languages — many African languages, several Southeast Asian dialects, less-represented Indian languages, Welsh, Basque. Whisper covers them in name but results are often unusable without significant cleanup or domain-specific fine-tuning.
The multilingual leadership claim
On most multilingual benchmarks (Fleurs, Common Voice), Whisper outperforms Deepgram Nova-3 and AssemblyAI Universal-2 — both of those are English-first models that added multilingual support later. If your use case is multilingual, Whisper is typically the right starting point.
Whisper variants — which to actually use
“Whisper” is several things. The original OpenAI release is one implementation; the community has built faster, smaller, and more feature-rich variants on top of the same models.
| Variant | License | Best for | Speed vs reference | Notes |
|---|---|---|---|---|
| OpenAI Whisper API | Closed (hosted) | Simplest integration | Hosted (varies) | API runs large-v2 typically; not always large-v3 |
| OpenAI reference (open source) | MIT | Research / reference | 1× (baseline) | Reference implementation; not optimized for speed |
| faster-whisper (SYSTRAN) | MIT | Production self-hosting | ~4× faster | CTranslate2 backend; same accuracy as reference |
| whisper.cpp (ggerganov) | MIT | CPU / edge / mobile | ~2× faster on CPU | Pure C++; runs on phones via quantization |
| WhisperX (m-bain) | BSD-4-Clause | Production with diarization | ~7× faster | Adds word-level timestamps + pyannote diarization |
| distil-whisper | MIT | Batch processing at scale | ~6× faster | Distilled student model; ~1.5% WER trade-off |
| MLX Whisper | MIT | Apple Silicon Macs | ~3× faster on M-series | Optimized for Apple’s Metal performance shaders |
What most production systems actually use: faster-whisper or WhisperX, not the OpenAI reference implementation. The reference is for research; the optimized variants are for shipping. WhisperX is the right choice if you need speaker diarization without integrating pyannote separately.
Whisper vs commercial alternatives
Honest comparison across the dominant ASR options in 2026. WER numbers are from each vendor’s published benchmarks plus the Open ASR Leaderboard.
| Provider | LibriSpeech WER | Real-world meeting WER | Languages | Diarization | Self-host | Cost / min |
|---|---|---|---|---|---|---|
| Whisper large-v3 | ~2.7% | ~8-12% | 99 | Add WhisperX or pyannote | Yes (MIT) | $0.006 API / lower self-hosted |
| Deepgram Nova-3 | ~2.5% | ~6-10% | ~40 | Built-in | No (managed) | $0.0043 |
| AssemblyAI Universal-2 | ~2.4% | ~6-9% | ~35 | Built-in (strong) | No (managed) | $0.0062 |
| Google STT Chirp-2 | ~3.0% | ~9-12% | 125+ | Built-in | No (managed) | $0.024 |
| Speechmatics | ~2.6% | ~7-10% | ~50 | Built-in | Yes (enterprise) | $0.012 cloud |
The honest verdict
- Multilingual or self-host needed → Whisper. Nothing else covers 99 languages with an open-source license.
- Real-time streaming + telephony → Deepgram. Built for sub-300ms latency and phone-audio codecs.
- English meeting audio with strong speaker labels → AssemblyAI. Their diarization consistently outperforms WhisperX on multi-speaker real-world meetings.
- Already on Google Cloud → Chirp-2, despite higher cost — integration savings often dominate.
When Whisper fails — the failure modes
Hallucination during silence
Whisper has a documented and well-reproduced failure mode: during long silences or very low-volume audio, it invents text— often a repeated phrase like “Thank you for watching,” “Subtitles by the Amara.org community,” or a foreign-language sentence. The cause is the model’s training on YouTube-style content where these phrases follow silent sections in subtitle tracks.
A 2024 Stanford study documented hallucinations in 1.4% of Whisper transcripts of clinical audio, with some inventing entire fabricated medical content — a serious concern for healthcare use.
Mitigation:run Voice Activity Detection (VAD) preprocessing to skip silent segments, or use faster-whisper’s no_speech_threshold + vad_filter parameters to flag and drop low-confidence segments. WhisperX bundles VAD by default.
Music + speech overlap
When music plays under speech, Whisper often produces lyrics-as-transcript or skips the segment entirely. The model wasn’t trained to separate sources.
Mitigation: source separation preprocessing (Spleeter, demucs) to isolate vocals before transcription.
Multi-speaker overlap (no built-in diarization)
Whisper produces a single text stream regardless of how many people are talking. When two speakers overlap, output garbles or drops one speaker entirely.
Mitigation: pair Whisper with pyannote-audio for speaker diarization, or use WhisperX which bundles both.
Code-switching (mid-sentence language switches)
Speakers who switch languages mid-sentence (common in bilingual conversations) confuse Whisper’s language detection. Output often picks one language and mis-transcribes the other.
Mitigation: chunk audio at language boundaries if possible, or use specialized code-switching ASR (research models exist but not commercial yet).
Numbers and named entities
Phone numbers, addresses, drug names, proper nouns — the highest-error categories across all ASR, not just Whisper. Always spot-check these before relying on the transcript.
Mitigation: domain-specific post-processing (e.g., regex validation for phone numbers; lookup tables for known proper nouns) catches the most common errors.
Cost per minute
Pricing captured June 2026. Verify on each vendor’s current page before committing.
| Option | Cost per minute | Cost per hour | Best for |
|---|---|---|---|
| OpenAI Whisper API | $0.006 | $0.36 | Easiest integration; small/medium volume |
| Self-hosted Whisper (rented GPU) | ~$0.0017-0.005 | ~$0.10-0.30 | High volume; full privacy required |
| Self-hosted whisper.cpp (CPU) | Nominal compute | Nominal compute | Edge / mobile / batch with no time pressure |
| Deepgram Nova-3 | $0.0043 | $0.26 | Real-time, telephony, lowest cost on managed |
| AssemblyAI Universal-2 | $0.0062 | $0.37 | English meeting audio with strong diarization |
| Google STT Chirp-2 | $0.024 | $1.44 | Google Cloud ecosystem integration |
At what volume does self-host beat the API? Roughly 1,000-3,000 hours/month, depending on your engineering capacity. Below that, the OpenAI API is cheaper than running your own GPU when you factor in DevOps time.
When to use Whisper, when not to
Use Whisper when
- You need multilingual transcription across more than ~40 languages
- You need self-hosting (privacy, compliance, no cloud upload)
- You’re doing batch processing (latency doesn’t matter)
- Budget-sensitive and you have GPU capacity
- You’re building a research or academic project where open-source is required
- You’re an edge / mobile deployment (whisper.cpp on phones)
Don’t use Whisper when
- You need real-time streaming with sub-300ms latency → Deepgram wins
- You’re transcribing telephony (8 kHz codec) at scale → Deepgram has telephony-tuned models
- You need built-in diarizationwithout the engineering work → AssemblyAI’s diarization is stronger than WhisperX out of the box
- You need a HIPAA-compliant managed service → AssemblyAI Enterprise or Deepgram Enterprise with active BAA; or self-host Whisper on institutional hardware
- You don’t have the engineering capacity to manage GPU inference and the OpenAI API cost isn’t justified by your volume → use a managed alternative
How we use Whisper at DeluxeScribe
DeluxeScribe uses Whisper-family models in production. The specific stack: WhisperX for diarization + word-level timestamps, faster-whisper backend for speed, custom VAD preprocessing to mitigate the silence-hallucination problem, and post-processing for proper nouns and numbers.
We picked Whisper for two reasons that drove the decision:
- 99-language coverage— none of the commercial alternatives match this. For a multilingual transcription product it’s essentially the only viable starting point.
- Self-hosting on our own infrastructure — keeps cost per minute predictable as we scale and avoids per-call API charges
Where DeluxeScribe adds value on top of Whisper: API ergonomics, in-browser editor, six export formats (TXT, DOCX, PDF, SRT, VTT, JSON), speaker label cleanup, and tuned preprocessing for problematic conditions (phone audio, background noise). The transcription quality is what Whisper delivers; the surrounding product is what you pay for.
How this page was verified
Related guides
- How to Transcribe AudioThe pillar — four paths (SaaS, free tools, self-hosted Whisper, native OS) and how to pick.
- Medical TranscriptionSelf-hosted Whisper is the HIPAA-compatible path for PHI-containing audio when your institution has the hardware.
- Interview TranscriptionFor IRB-strict qualitative research where audio can't leave your institution — Whisper self-hosted is the option.
- Conversation IntelligenceMost CI platforms use Whisper or AssemblyAI under the hood. The differentiation is the analysis layer, not the transcription.