How Accurate Is OpenAI Whisper in 2026? (Real-World WER by Condition and Language)

Whisper's accuracy varies by 10× between conditions. Here's the table nobody publishes — plus when a commercial ASR is the right call instead.

OpenAI Whisper large-v3 achieves around 2-5% Word Error Rate on clean studio English (the published LibriSpeech benchmark), but real-world accuracy varies sharply by audio condition and language. Realistic ranges: 5-10% on clean meeting audio, 10-20% on Zoom calls and phone audio, 30%+ on music + speech overlap. Whisper trails Deepgram Nova-3 and AssemblyAI Universal-2 by 1-3 percentage points on most English benchmarks but leads on most multilingual benchmarks. Below: WER by audio condition, WER by language, the model variants matrix (faster-whisper, whisper.cpp, WhisperX), the hallucination failure mode, and when to use a commercial ASR instead.

    Last verified June 29, 2026

    TL;DR — the headline numbers

    One number can’t describe Whisper’s accuracy honestly. Here are the ranges by audio condition for English:

    Audio conditionTypical WERReads like
    Studio English (LibriSpeech audiobook)2-5%Near-perfect; light proofreading only
    Clean podcast / well-mic'd single speaker3-7%Publication-grade after light cleanup
    Conference room meeting, good mics5-10%Useful as-is; spot-check names
    Zoom / Teams call with mixed mics10-15%Readable; edit before publication
    Phone call audio (8 kHz narrow-band codec)15-25%Gist clear; numbers and names unreliable
    Heavy accent + background noise20-30%Verify key points against audio
    Music overlapping with speech30%+ (often fails)Model may hallucinate lyrics
    Multilingual: top languages5-15%Comparable to English on clean audio
    Multilingual: low-resource languages30%+Often unusable; check alternatives

    What “accurate” means — Word Error Rate explained

    Word Error Rate (WER) is the standard accuracy metric for speech-to-text. It measures the percentage of words the transcript gets wrong, counting substitutions, insertions, and deletions equally.

    • 5% WER = 5 wrong words per 100 = reads cleanly; light proofreading
    • 10% WER = 1 in 10 wrong = readable but edit-required before publication
    • 25% WER = 1 in 4 wrong = unreliable for direct quoting; you need to re-listen

    The metric’s blind spot:WER weights every error equally. One missed phone number in a voicemail is functionally worse than 10 missed filler words, but WER doesn’t see that. Always spot-check proper nouns, numbers, and named entities — they’re the highest-impact error sites regardless of which ASR you use.

    Whisper model sizes — accuracy/speed tradeoff

    OpenAI publishes Whisper in six sizes. Production deployments mostly use large-v3 or large-v3-turbo; smaller sizes are for edge deployment or hardware-constrained scenarios.

    ModelParametersVRAMRelative speedUse when
    tiny39M~1 GB~32× real-time on CPUToy demos; not production
    base74M~1 GB~16× real-time on CPUEdge / mobile prototypes
    small244M~2 GB~6× real-time on CPUResource-constrained server
    medium769M~5 GB~2× real-time on CPUAcceptable production fallback
    large-v31.55B~10 GB~1× real-time on GPUMaximum accuracy; default for production
    large-v3-turbo809M~6 GB~8× real-time on GPUBest general-purpose; ~1% WER trade for 8× speed

    Most production systems use large-v3-turbo — the speed gain (8×) justifies the small accuracy cost (~1% WER) in almost every batch-transcription scenario.

    WER by audio condition (the table nobody else publishes)

    Whisper’s headline benchmark numbers are honest but misleading — they reflect the LibriSpeech corpus (clean-read audiobook English). Real audio looks very different. Here’s the breakdown by realistic condition, with the reason each one degrades accuracy:

    ConditionTypical WERWhy it degrades
    Studio podcast (single speaker, $300+ mic)2-5%Baseline — closest to LibriSpeech training distribution
    Conference room meeting (good ceiling mics)5-10%Multi-speaker turn-taking; some cross-talk
    Zoom / Teams call (consumer mics)10-15%Codec compression, mixed mic quality, occasional overlap
    Cellular phone audio (G.711, AMR)15-25%8 kHz narrow-band sampling strips consonant frequencies
    Voice memo on iPhone (close-mic, single speaker)5-12%Close-mic helps; phone's ambient noise hurts
    Field recording with wind / traffic15-30%Low signal-to-noise ratio confuses the model
    Lecture hall (distant mic, large room)10-20%Reverberation and audience noise
    Music + speech overlap (interview with score)30%+Model often hallucinates lyrics or skips sections
    Multi-speaker overlap (3+ talking simultaneously)20-40%Whisper has no diarization; output garbles

    The takeaway:if your audio is studio quality, expect Whisper’s headline numbers. If it’s anything else, plan for 2-5× higher error rates. Phone numbers, named entities, and technical jargon are mis-heard at higher rates than common words regardless of condition.

    WER by language

    Whisper supports 99 languages, but accuracy varies sharply. The OpenAI Whisper paper publishes WER on the Fleurs benchmark by language; the grouped summary:

    Top tier (under 5% WER on Fleurs)

    English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Russian, Chinese (Mandarin). These languages have abundant training data and Whisper performs near-English levels on clean audio.

    Mid tier (5-15% WER)

    Arabic, Hindi, Turkish, Dutch, Polish, Vietnamese, Indonesian, Hebrew, Greek, Czech, Swedish, Danish, Finnish, Norwegian, Romanian, Bulgarian, Ukrainian, Thai, Catalan. Production-usable with editing.

    Low tier (15-30%+ WER)

    Lower-resource languages — many African languages, several Southeast Asian dialects, less-represented Indian languages, Welsh, Basque. Whisper covers them in name but results are often unusable without significant cleanup or domain-specific fine-tuning.

    The multilingual leadership claim

    On most multilingual benchmarks (Fleurs, Common Voice), Whisper outperforms Deepgram Nova-3 and AssemblyAI Universal-2 — both of those are English-first models that added multilingual support later. If your use case is multilingual, Whisper is typically the right starting point.

    Whisper variants — which to actually use

    “Whisper” is several things. The original OpenAI release is one implementation; the community has built faster, smaller, and more feature-rich variants on top of the same models.

    VariantLicenseBest forSpeed vs referenceNotes
    OpenAI Whisper APIClosed (hosted)Simplest integrationHosted (varies)API runs large-v2 typically; not always large-v3
    OpenAI reference (open source)MITResearch / reference1× (baseline)Reference implementation; not optimized for speed
    faster-whisper (SYSTRAN)MITProduction self-hosting~4× fasterCTranslate2 backend; same accuracy as reference
    whisper.cpp (ggerganov)MITCPU / edge / mobile~2× faster on CPUPure C++; runs on phones via quantization
    WhisperX (m-bain)BSD-4-ClauseProduction with diarization~7× fasterAdds word-level timestamps + pyannote diarization
    distil-whisperMITBatch processing at scale~6× fasterDistilled student model; ~1.5% WER trade-off
    MLX WhisperMITApple Silicon Macs~3× faster on M-seriesOptimized for Apple’s Metal performance shaders

    What most production systems actually use: faster-whisper or WhisperX, not the OpenAI reference implementation. The reference is for research; the optimized variants are for shipping. WhisperX is the right choice if you need speaker diarization without integrating pyannote separately.

    Whisper vs commercial alternatives

    Honest comparison across the dominant ASR options in 2026. WER numbers are from each vendor’s published benchmarks plus the Open ASR Leaderboard.

    ProviderLibriSpeech WERReal-world meeting WERLanguagesDiarizationSelf-hostCost / min
    Whisper large-v3~2.7%~8-12%99Add WhisperX or pyannoteYes (MIT)$0.006 API / lower self-hosted
    Deepgram Nova-3~2.5%~6-10%~40Built-inNo (managed)$0.0043
    AssemblyAI Universal-2~2.4%~6-9%~35Built-in (strong)No (managed)$0.0062
    Google STT Chirp-2~3.0%~9-12%125+Built-inNo (managed)$0.024
    Speechmatics~2.6%~7-10%~50Built-inYes (enterprise)$0.012 cloud

    The honest verdict

    • Multilingual or self-host needed → Whisper. Nothing else covers 99 languages with an open-source license.
    • Real-time streaming + telephony → Deepgram. Built for sub-300ms latency and phone-audio codecs.
    • English meeting audio with strong speaker labels → AssemblyAI. Their diarization consistently outperforms WhisperX on multi-speaker real-world meetings.
    • Already on Google Cloud → Chirp-2, despite higher cost — integration savings often dominate.

    When Whisper fails — the failure modes

    Hallucination during silence

    Whisper has a documented and well-reproduced failure mode: during long silences or very low-volume audio, it invents text— often a repeated phrase like “Thank you for watching,” “Subtitles by the Amara.org community,” or a foreign-language sentence. The cause is the model’s training on YouTube-style content where these phrases follow silent sections in subtitle tracks.

    A 2024 Stanford study documented hallucinations in 1.4% of Whisper transcripts of clinical audio, with some inventing entire fabricated medical content — a serious concern for healthcare use.

    Mitigation:run Voice Activity Detection (VAD) preprocessing to skip silent segments, or use faster-whisper’s no_speech_threshold + vad_filter parameters to flag and drop low-confidence segments. WhisperX bundles VAD by default.

    Music + speech overlap

    When music plays under speech, Whisper often produces lyrics-as-transcript or skips the segment entirely. The model wasn’t trained to separate sources.

    Mitigation: source separation preprocessing (Spleeter, demucs) to isolate vocals before transcription.

    Multi-speaker overlap (no built-in diarization)

    Whisper produces a single text stream regardless of how many people are talking. When two speakers overlap, output garbles or drops one speaker entirely.

    Mitigation: pair Whisper with pyannote-audio for speaker diarization, or use WhisperX which bundles both.

    Code-switching (mid-sentence language switches)

    Speakers who switch languages mid-sentence (common in bilingual conversations) confuse Whisper’s language detection. Output often picks one language and mis-transcribes the other.

    Mitigation: chunk audio at language boundaries if possible, or use specialized code-switching ASR (research models exist but not commercial yet).

    Numbers and named entities

    Phone numbers, addresses, drug names, proper nouns — the highest-error categories across all ASR, not just Whisper. Always spot-check these before relying on the transcript.

    Mitigation: domain-specific post-processing (e.g., regex validation for phone numbers; lookup tables for known proper nouns) catches the most common errors.

    Cost per minute

    Pricing captured June 2026. Verify on each vendor’s current page before committing.

    OptionCost per minuteCost per hourBest for
    OpenAI Whisper API$0.006$0.36Easiest integration; small/medium volume
    Self-hosted Whisper (rented GPU)~$0.0017-0.005~$0.10-0.30High volume; full privacy required
    Self-hosted whisper.cpp (CPU)Nominal computeNominal computeEdge / mobile / batch with no time pressure
    Deepgram Nova-3$0.0043$0.26Real-time, telephony, lowest cost on managed
    AssemblyAI Universal-2$0.0062$0.37English meeting audio with strong diarization
    Google STT Chirp-2$0.024$1.44Google Cloud ecosystem integration

    At what volume does self-host beat the API? Roughly 1,000-3,000 hours/month, depending on your engineering capacity. Below that, the OpenAI API is cheaper than running your own GPU when you factor in DevOps time.

    When to use Whisper, when not to

    Use Whisper when

    • You need multilingual transcription across more than ~40 languages
    • You need self-hosting (privacy, compliance, no cloud upload)
    • You’re doing batch processing (latency doesn’t matter)
    • Budget-sensitive and you have GPU capacity
    • You’re building a research or academic project where open-source is required
    • You’re an edge / mobile deployment (whisper.cpp on phones)

    Don’t use Whisper when

    • You need real-time streaming with sub-300ms latency → Deepgram wins
    • You’re transcribing telephony (8 kHz codec) at scale → Deepgram has telephony-tuned models
    • You need built-in diarizationwithout the engineering work → AssemblyAI’s diarization is stronger than WhisperX out of the box
    • You need a HIPAA-compliant managed service → AssemblyAI Enterprise or Deepgram Enterprise with active BAA; or self-host Whisper on institutional hardware
    • You don’t have the engineering capacity to manage GPU inference and the OpenAI API cost isn’t justified by your volume → use a managed alternative

    How we use Whisper at DeluxeScribe

    DeluxeScribe uses Whisper-family models in production. The specific stack: WhisperX for diarization + word-level timestamps, faster-whisper backend for speed, custom VAD preprocessing to mitigate the silence-hallucination problem, and post-processing for proper nouns and numbers.

    We picked Whisper for two reasons that drove the decision:

    • 99-language coverage— none of the commercial alternatives match this. For a multilingual transcription product it’s essentially the only viable starting point.
    • Self-hosting on our own infrastructure — keeps cost per minute predictable as we scale and avoids per-call API charges

    Where DeluxeScribe adds value on top of Whisper: API ergonomics, in-browser editor, six export formats (TXT, DOCX, PDF, SRT, VTT, JSON), speaker label cleanup, and tuned preprocessing for problematic conditions (phone audio, background noise). The transcription quality is what Whisper delivers; the surrounding product is what you pay for.

    Try Whisper-based transcription without the engineering

    60 minutes free, no credit card. Same model class commercial CI tools use under the hood, with diarization, six export formats, and 99 languages.

    How this page was verified

    Benchmark WER numbers come from the OpenAI Whisper paper (Radford et al., 2022), the Whisper large-v3 model card, and the Hugging Face Open ASR Leaderboard. Variant performance references SYSTRAN faster-whisper, ggerganov whisper.cpp, m-bain WhisperX, and distil-whisper. Pricing was captured June 2026 from OpenAI, Deepgram, and AssemblyAI. Real-world WER-by-condition ranges combine published ASR benchmarks (LibriSpeech for clean studio; AMI Meeting Corpus for multi-speaker meetings; CHiME for noisy conditions) with our own observations running Whisper-family models in production on customer audio at DeluxeScribe. Hallucination failure mode is documented in the Whisper paper appendix and the Stanford study on Whisper hallucinations in clinical transcripts.

    Frequently Asked Questions

    What is OpenAI Whisper's Word Error Rate?

    On the standard LibriSpeech test-clean benchmark (clean studio English audiobook audio), Whisper large-v3 achieves around 2-5% WER. On real-world audio it varies sharply by condition: 5-10% on clean meeting English, 10-15% on Zoom calls, 15-25% on phone audio (narrow-band 8 kHz codec), and 30%+ when music overlaps with speech. The blanket '99% accurate' claim some vendors use is the benchmark number, not what you should expect on your audio.

    Is Whisper more accurate than Deepgram or AssemblyAI?

    Depends on the benchmark. Whisper large-v3 trails Deepgram Nova-3 and AssemblyAI Universal-2 by 1-3 percentage points on most English benchmarks. On multilingual benchmarks (Fleurs, Common Voice), Whisper leads most languages. For real-time streaming and telephony, Deepgram wins. For English meeting audio with strong speaker diarization out of the box, AssemblyAI wins. There's no universal winner.

    Which Whisper model size should I use?

    For production, large-v3 or large-v3-turbo. The turbo variant trades roughly 1% WER for 8× speed and is the best general-purpose choice. medium is acceptable for non-critical use or when GPU memory is tight; tiny and base are too inaccurate for production use beyond toy demos. distil-whisper (a distilled variant) gives 6× speed for ~1.5% WER loss — worth it for batch processing at scale.

    What's the difference between OpenAI Whisper API and open-source Whisper?

    The OpenAI Whisper API runs large-v2 (as of mid-2026 last checked) with OpenAI's preprocessing applied; the open-source release of large-v3 is a newer model. Accuracy is broadly similar but not identical. Open-source variants like faster-whisper and whisper.cpp run the same models with different runtimes — faster-whisper is typically 4× quicker than the reference implementation; whisper.cpp runs on CPU including phones. Output is essentially the same model; runtime and speed differ.

    Does Whisper do speaker diarization?

    No, not natively. Whisper transcribes audio but doesn't identify who said what. For speaker labels you need to pair Whisper with a diarization model: pyannote-audio is the open-source standard, and WhisperX bundles both together with word-level timestamps. Commercial ASR services (Deepgram, AssemblyAI) include diarization out of the box.

    Why does Whisper sometimes hallucinate text during silence?

    Whisper has a documented failure mode where it invents text — often a repeated phrase like 'Thank you for watching' or 'Subtitles by the Amara.org community' — during long silences or low-volume audio. The cause is the model's training on YouTube-style content where such phrases follow silent sections. Mitigations: run Voice Activity Detection (VAD) as preprocessing to skip silent segments, or use faster-whisper's no_speech_threshold parameter to flag and drop low-confidence segments.

    How accurate is Whisper for non-English languages?

    Top tier (under 5% WER on the Fleurs benchmark): English, Spanish, French, German, Italian, Portuguese, Japanese, Korean, Russian, Chinese. Mid tier (5-15%): Arabic, Hindi, Turkish, Dutch, Polish, Vietnamese, Indonesian, Hebrew. Low tier (15-30%+): low-resource languages with less training data. Whisper still leads most multilingual benchmarks compared to commercial alternatives — it's the multilingual leader.

    How much does Whisper cost vs commercial ASR?

    OpenAI Whisper API: $0.006/min ($0.36/hour). Self-hosted Whisper on rented GPU: $0.10-0.30/hour effective cost depending on scale. Deepgram Nova-3: $0.0043/min pre-recorded. AssemblyAI Universal-2: $0.0062/min. Google STT Chirp-2: $0.024/min (highest). At low volume the OpenAI API is easiest; at high volume self-hosted Whisper is cheapest if you have the engineering capacity.