What is OpenAI Whisper's Word Error Rate?

On the standard LibriSpeech test-clean benchmark (clean studio English audiobook audio), Whisper large-v3 achieves around 2-5% WER. On real-world audio it varies sharply by condition: 5-10% on clean meeting English, 10-15% on Zoom calls, 15-25% on phone audio (narrow-band 8 kHz codec), and 30%+ when music overlaps with speech. The blanket '99% accurate' claim some vendors use is the benchmark number, not what you should expect on your audio.

Is Whisper more accurate than Deepgram or AssemblyAI?

Depends on the benchmark. Whisper large-v3 trails Deepgram Nova-3 and AssemblyAI Universal-2 by 1-3 percentage points on most English benchmarks. On multilingual benchmarks (Fleurs, Common Voice), Whisper leads most languages. For real-time streaming and telephony, Deepgram wins. For English meeting audio with strong speaker diarization out of the box, AssemblyAI wins. There's no universal winner.

Which Whisper model size should I use?

For production, large-v3 or large-v3-turbo. The turbo variant trades roughly 1% WER for 8× speed and is the best general-purpose choice. medium is acceptable for non-critical use or when GPU memory is tight; tiny and base are too inaccurate for production use beyond toy demos. distil-whisper (a distilled variant) gives 6× speed for ~1.5% WER loss — worth it for batch processing at scale.

What's the difference between OpenAI Whisper API and open-source Whisper?

The OpenAI Whisper API runs large-v2 (as of mid-2026 last checked) with OpenAI's preprocessing applied; the open-source release of large-v3 is a newer model. Accuracy is broadly similar but not identical. Open-source variants like faster-whisper and whisper.cpp run the same models with different runtimes — faster-whisper is typically 4× quicker than the reference implementation; whisper.cpp runs on CPU including phones. Output is essentially the same model; runtime and speed differ.

Does Whisper do speaker diarization?

No, not natively. Whisper transcribes audio but doesn't identify who said what. For speaker labels you need to pair Whisper with a diarization model: pyannote-audio is the open-source standard, and WhisperX bundles both together with word-level timestamps. Commercial ASR services (Deepgram, AssemblyAI) include diarization out of the box.

Why does Whisper sometimes hallucinate text during silence?

Whisper has a documented failure mode where it invents text — often a repeated phrase like 'Thank you for watching' or 'Subtitles by the Amara.org community' — during long silences or low-volume audio. The cause is the model's training on YouTube-style content where such phrases follow silent sections. Mitigations: run Voice Activity Detection (VAD) as preprocessing to skip silent segments, or use faster-whisper's no_speech_threshold parameter to flag and drop low-confidence segments.

How accurate is Whisper for non-English languages?

Top tier (under 5% WER on the Fleurs benchmark): English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Russian, Chinese. Mid tier (5-15%): Arabic, Hindi, Turkish, Dutch, Polish, Vietnamese, Indonesian, Hebrew. Low tier (15-30%+): low-resource languages with less training data. Whisper still leads most multilingual benchmarks compared to commercial alternatives — it's the multilingual leader.

How much does Whisper cost vs commercial ASR?

OpenAI Whisper API: $0.006/min ($0.36/hour). Self-hosted Whisper on rented GPU: $0.10-0.30/hour effective cost depending on scale. Deepgram Nova-3: $0.0043/min pre-recorded. AssemblyAI Universal-2: $0.0062/min. Google STT Chirp-2: $0.024/min (highest). At low volume the OpenAI API is easiest; at high volume self-hosted Whisper is cheapest if you have the engineering capacity.

How Accurate Is OpenAI Whisper in 2026? (Real-World WER by Condition and Language)

Whisper's accuracy varies by 10× between conditions. Here's the table nobody publishes — plus when a commercial ASR is the right call instead.

OpenAI Whisper large-v3 achieves around 2-5% Word Error Rate on clean studio English (the published LibriSpeech benchmark), but real-world accuracy varies sharply by audio condition and language. Realistic ranges: 5-10% on clean meeting audio, 10-20% on Zoom calls and phone audio, 30%+ on music + speech overlap. Whisper trails Deepgram Nova-3 and AssemblyAI Universal-2 by 1-3 percentage points on most English benchmarks but leads on most multilingual benchmarks. Below: WER by audio condition, WER by language, the model variants matrix (faster-whisper, whisper.cpp, WhisperX), the hallucination failure mode, and when to use a commercial ASR instead.

Last verified June 29, 2026

TL;DR — the headline numbers

One number can’t describe Whisper’s accuracy honestly. Here are the ranges by audio condition for English:

Audio condition	Typical WER	Reads like
Studio English (LibriSpeech audiobook)	2-5%	Near-perfect; light proofreading only
Clean podcast / well-mic'd single speaker	3-7%	Publication-grade after light cleanup
Conference room meeting, good mics	5-10%	Useful as-is; spot-check names
Zoom / Teams call with mixed mics	10-15%	Readable; edit before publication
Phone call audio (8 kHz narrow-band codec)	15-25%	Gist clear; numbers and names unreliable
Heavy accent + background noise	20-30%	Verify key points against audio
Music overlapping with speech	30%+ (often fails)	Model may hallucinate lyrics
Multilingual: top languages	5-15%	Comparable to English on clean audio
Multilingual: low-resource languages	30%+	Often unusable; check alternatives

What “accurate” means — Word Error Rate explained

Word Error Rate (WER) is the standard accuracy metric for speech-to-text. It measures the percentage of words the transcript gets wrong, counting substitutions, insertions, and deletions equally.

5% WER = 5 wrong words per 100 = reads cleanly; light proofreading
10% WER = 1 in 10 wrong = readable but edit-required before publication
25% WER = 1 in 4 wrong = unreliable for direct quoting; you need to re-listen

The metric’s blind spot:WER weights every error equally. One missed phone number in a voicemail is functionally worse than 10 missed filler words, but WER doesn’t see that. Always spot-check proper nouns, numbers, and named entities — they’re the highest-impact error sites regardless of which ASR you use.

Whisper model sizes — accuracy/speed tradeoff

OpenAI publishes Whisper in six sizes. Production deployments mostly use large-v3 or large-v3-turbo; smaller sizes are for edge deployment or hardware-constrained scenarios.

Model	Parameters	VRAM	Relative speed	Use when
tiny	39M	~1 GB	~32× real-time on CPU	Toy demos; not production
base	74M	~1 GB	~16× real-time on CPU	Edge / mobile prototypes
small	244M	~2 GB	~6× real-time on CPU	Resource-constrained server
medium	769M	~5 GB	~2× real-time on CPU	Acceptable production fallback
large-v3	1.55B	~10 GB	~1× real-time on GPU	Maximum accuracy; default for production
large-v3-turbo	809M	~6 GB	~8× real-time on GPU	Best general-purpose; ~1% WER trade for 8× speed

Most production systems use large-v3-turbo — the speed gain (8×) justifies the small accuracy cost (~1% WER) in almost every batch-transcription scenario.

WER by audio condition (the table nobody else publishes)

Whisper’s headline benchmark numbers are honest but misleading — they reflect the LibriSpeech corpus (clean-read audiobook English). Real audio looks very different. Here’s the breakdown by realistic condition, with the reason each one degrades accuracy:

Condition	Typical WER	Why it degrades
Studio podcast (single speaker, $300+ mic)	2-5%	Baseline — closest to LibriSpeech training distribution
Conference room meeting (good ceiling mics)	5-10%	Multi-speaker turn-taking; some cross-talk
Zoom / Teams call (consumer mics)	10-15%	Codec compression, mixed mic quality, occasional overlap
Cellular phone audio (G.711, AMR)	15-25%	8 kHz narrow-band sampling strips consonant frequencies
Voice memo on iPhone (close-mic, single speaker)	5-12%	Close-mic helps; phone's ambient noise hurts
Field recording with wind / traffic	15-30%	Low signal-to-noise ratio confuses the model
Lecture hall (distant mic, large room)	10-20%	Reverberation and audience noise
Music + speech overlap (interview with score)	30%+	Model often hallucinates lyrics or skips sections
Multi-speaker overlap (3+ talking simultaneously)	20-40%	Whisper has no diarization; output garbles

The takeaway:if your audio is studio quality, expect Whisper’s headline numbers. If it’s anything else, plan for 2-5× higher error rates. Phone numbers, named entities, and technical jargon are mis-heard at higher rates than common words regardless of condition.

WER by language

Whisper supports 99 languages, but accuracy varies sharply. The OpenAI Whisper paper publishes WER on the Fleurs benchmark by language; the grouped summary:

Top tier (under 5% WER on Fleurs)

English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Russian, Chinese (Mandarin). These languages have abundant training data and Whisper performs near-English levels on clean audio.

Mid tier (5-15% WER)

Arabic, Hindi, Turkish, Dutch, Polish, Vietnamese, Indonesian, Hebrew, Greek, Czech, Swedish, Danish, Finnish, Norwegian, Romanian, Bulgarian, Ukrainian, Thai, Catalan. Production-usable with editing.

Low tier (15-30%+ WER)

Lower-resource languages — many African languages, several Southeast Asian dialects, less-represented Indian languages, Welsh, Basque. Whisper covers them in name but results are often unusable without significant cleanup or domain-specific fine-tuning.

The multilingual leadership claim

On most multilingual benchmarks (Fleurs, Common Voice), Whisper outperforms Deepgram Nova-3 and AssemblyAI Universal-2 — both of those are English-first models that added multilingual support later. If your use case is multilingual, Whisper is typically the right starting point.

Whisper variants — which to actually use

“Whisper” is several things. The original OpenAI release is one implementation; the community has built faster, smaller, and more feature-rich variants on top of the same models.

Variant	License	Best for	Speed vs reference	Notes
OpenAI Whisper API	Closed (hosted)	Simplest integration	Hosted (varies)	API runs large-v2 typically; not always large-v3
OpenAI reference (open source)	MIT	Research / reference	1× (baseline)	Reference implementation; not optimized for speed
faster-whisper (SYSTRAN)	MIT	Production self-hosting	~4× faster	CTranslate2 backend; same accuracy as reference
whisper.cpp (ggerganov)	MIT	CPU / edge / mobile	~2× faster on CPU	Pure C++; runs on phones via quantization
WhisperX (m-bain)	BSD-4-Clause	Production with diarization	~7× faster	Adds word-level timestamps + pyannote diarization
distil-whisper	MIT	Batch processing at scale	~6× faster	Distilled student model; ~1.5% WER trade-off
MLX Whisper	MIT	Apple Silicon Macs	~3× faster on M-series	Optimized for Apple’s Metal performance shaders

What most production systems actually use: faster-whisper or WhisperX, not the OpenAI reference implementation. The reference is for research; the optimized variants are for shipping. WhisperX is the right choice if you need speaker diarization without integrating pyannote separately.

Whisper vs commercial alternatives

Honest comparison across the dominant ASR options in 2026. WER numbers are from each vendor’s published benchmarks plus the Open ASR Leaderboard.

Provider	LibriSpeech WER	Real-world meeting WER	Languages	Diarization	Self-host	Cost / min
Whisper large-v3	~2.7%	~8-12%	99	Add WhisperX or pyannote	Yes (MIT)	$0.006 API / lower self-hosted
Deepgram Nova-3	~2.5%	~6-10%	~40	Built-in	No (managed)	$0.0043
AssemblyAI Universal-2	~2.4%	~6-9%	~35	Built-in (strong)	No (managed)	$0.0062
Google STT Chirp-2	~3.0%	~9-12%	125+	Built-in	No (managed)	$0.024
Speechmatics	~2.6%	~7-10%	~50	Built-in	Yes (enterprise)	$0.012 cloud

The honest verdict

Multilingual or self-host needed → Whisper. Nothing else covers 99 languages with an open-source license.
Real-time streaming + telephony → Deepgram. Built for sub-300ms latency and phone-audio codecs.
English meeting audio with strong speaker labels → AssemblyAI. Their diarization consistently outperforms WhisperX on multi-speaker real-world meetings.
Already on Google Cloud → Chirp-2, despite higher cost — integration savings often dominate.

When Whisper fails — the failure modes

Hallucination during silence

Whisper has a documented and well-reproduced failure mode: during long silences or very low-volume audio, it invents text— often a repeated phrase like “Thank you for watching,” “Subtitles by the Amara.org community,” or a foreign-language sentence. The cause is the model’s training on YouTube-style content where these phrases follow silent sections in subtitle tracks.

A 2024 Stanford study documented hallucinations in 1.4% of Whisper transcripts of clinical audio, with some inventing entire fabricated medical content — a serious concern for healthcare use.

Mitigation:run Voice Activity Detection (VAD) preprocessing to skip silent segments, or use faster-whisper’s no_speech_threshold + vad_filter parameters to flag and drop low-confidence segments. WhisperX bundles VAD by default.

Music + speech overlap

When music plays under speech, Whisper often produces lyrics-as-transcript or skips the segment entirely. The model wasn’t trained to separate sources.

Mitigation: source separation preprocessing (Spleeter, demucs) to isolate vocals before transcription.

Multi-speaker overlap (no built-in diarization)

Whisper produces a single text stream regardless of how many people are talking. When two speakers overlap, output garbles or drops one speaker entirely.

Mitigation: pair Whisper with pyannote-audio for speaker diarization, or use WhisperX which bundles both.

Code-switching (mid-sentence language switches)

Speakers who switch languages mid-sentence (common in bilingual conversations) confuse Whisper’s language detection. Output often picks one language and mis-transcribes the other.

Mitigation: chunk audio at language boundaries if possible, or use specialized code-switching ASR (research models exist but not commercial yet).

Numbers and named entities

Phone numbers, addresses, drug names, proper nouns — the highest-error categories across all ASR, not just Whisper. Always spot-check these before relying on the transcript.

Mitigation: domain-specific post-processing (e.g., regex validation for phone numbers; lookup tables for known proper nouns) catches the most common errors.

Cost per minute

Pricing captured June 2026. Verify on each vendor’s current page before committing.

Option	Cost per minute	Cost per hour	Best for
OpenAI Whisper API	$0.006	$0.36	Easiest integration; small/medium volume
Self-hosted Whisper (rented GPU)	~$0.0017-0.005	~$0.10-0.30	High volume; full privacy required
Self-hosted whisper.cpp (CPU)	Nominal compute	Nominal compute	Edge / mobile / batch with no time pressure
Deepgram Nova-3	$0.0043	$0.26	Real-time, telephony, lowest cost on managed
AssemblyAI Universal-2	$0.0062	$0.37	English meeting audio with strong diarization
Google STT Chirp-2	$0.024	$1.44	Google Cloud ecosystem integration

At what volume does self-host beat the API? Roughly 1,000-3,000 hours/month, depending on your engineering capacity. Below that, the OpenAI API is cheaper than running your own GPU when you factor in DevOps time.

When to use Whisper, when not to

Use Whisper when

You need multilingual transcription across more than ~40 languages
You need self-hosting (privacy, compliance, no cloud upload)
You’re doing batch processing (latency doesn’t matter)
Budget-sensitive and you have GPU capacity
You’re building a research or academic project where open-source is required
You’re an edge / mobile deployment (whisper.cpp on phones)

Don’t use Whisper when

You need real-time streaming with sub-300ms latency → Deepgram wins
You’re transcribing telephony (8 kHz codec) at scale → Deepgram has telephony-tuned models
You need built-in diarizationwithout the engineering work → AssemblyAI’s diarization is stronger than WhisperX out of the box
You need a HIPAA-compliant managed service → AssemblyAI Enterprise or Deepgram Enterprise with active BAA; or self-host Whisper on institutional hardware
You don’t have the engineering capacity to manage GPU inference and the OpenAI API cost isn’t justified by your volume → use a managed alternative

How we use Whisper at DeluxeScribe

DeluxeScribe uses Whisper-family models in production. The specific stack: WhisperX for diarization + word-level timestamps, faster-whisper backend for speed, custom VAD preprocessing to mitigate the silence-hallucination problem, and post-processing for proper nouns and numbers.

We picked Whisper for two reasons that drove the decision:

99-language coverage— none of the commercial alternatives match this. For a multilingual transcription product it’s essentially the only viable starting point.
Self-hosting on our own infrastructure — keeps cost per minute predictable as we scale and avoids per-call API charges

Where DeluxeScribe adds value on top of Whisper: API ergonomics, in-browser editor, six export formats (TXT, DOCX, PDF, SRT, VTT, JSON), speaker label cleanup, and tuned preprocessing for problematic conditions (phone audio, background noise). The transcription quality is what Whisper delivers; the surrounding product is what you pay for.

Try Whisper-based transcription without the engineering

60 minutes free, no credit card. Same model class commercial CI tools use under the hood, with diarization, six export formats, and 99 languages.

How this page was verified

Benchmark WER numbers come from the OpenAI Whisper paper (Radford et al., 2022), the Whisper large-v3 model card, and the Hugging Face Open ASR Leaderboard. Variant performance references SYSTRAN faster-whisper, ggerganov whisper.cpp, m-bain WhisperX, and distil-whisper. Pricing was captured June 2026 from OpenAI, Deepgram, and AssemblyAI. Real-world WER-by-condition ranges combine published ASR benchmarks (LibriSpeech for clean studio; AMI Meeting Corpus for multi-speaker meetings; CHiME for noisy conditions) with our own observations running Whisper-family models in production on customer audio at DeluxeScribe. Hallucination failure mode is documented in the Whisper paper appendix and the Stanford study on Whisper hallucinations in clinical transcripts.