WAV to Text: highest-accuracy audio transcription, and the file-size math nobody publishes
WAV is uncompressed audio — the highest-fidelity input for AI transcription. Here's the workflow, why WAV produces more accurate text than MP3, and what to do when the file is too big to upload.
.srt / .docx export. The tradeoff: WAV files are large — a 1-hour recording at 44.1 kHz stereo weighs ~600 MB, and studio recordings at 96 kHz 24-bit can hit 1.9 GB per hour. DeluxeScribe accepts WAV files up to 5 GB with 99-language auto-detection, 60 minutes free, no credit card. Below: the workflow, the file-size math, the FFmpeg compression command for when the file is too big, and the WAV sub-format edge cases that trip up simple parsers.- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified July 4, 2026
TL;DR — pick your path
| Your situation | Best path | Cost |
|---|---|---|
| Small WAV (< 100 MB), one-off | Upload to DeluxeScribe | 60 min free, then ~$10/mo |
| Big WAV (> 500 MB) | Compress with FFmpeg to M4A → upload | Free (FFmpeg) + service tier |
| Studio recording (96 kHz 24-bit, multi-hour) | Downsample with FFmpeg, then upload | Free (FFmpeg) + service tier |
| Batch of many WAVs | Sign up, use bulk upload | ~$10/mo subscription |
| Sensitive content (court audio, PHI, unreleased) | Self-hosted Whisper — no upload | Free after Python setup |
| Non-PCM WAV (Opus, µ-law, ADPCM) | Convert to PCM first with FFmpeg → upload | Free (FFmpeg) + service tier |
| Court reporting / evidentiary | Human transcription (Rev human, GoTranscript) | $1.50–$3.00/min |
| Just want text fast, don’t care about accuracy | Any tool — WAV is universally accepted | Varies |
Upload workflow — the 4 steps
Same workflow as any other cloud transcription service, with two WAV-specific notes at the end.
- Sign up (60 minutes free, no credit card).
- Drag the .wav into the upload area, or click to browse. Files up to 5 GB are accepted — that covers roughly 8 hours at CD quality (44.1 kHz stereo) or 15 hours of mono 22 kHz.
- Language auto-detects. Leave it unless you want to force a specific dialect. Speaker labels default on; toggle off for solo recordings.
- Wait 1–3 minutes for typical files (30-60 min recording). Multi-hour files scale roughly linearly.
- Review and export.Fix any mis-heard proper nouns, technical terms, or numbers in the browser editor (usually 2–5 minutes of cleanup on a 30-minute file). Export as TXT, DOCX, PDF, SRT, VTT, or JSON with word-level timestamps.
Upload a WAV file and get text in minutes
60 minutes free, no credit card. WAV up to 5 GB accepted. Speaker labels, timestamps, and .srt / .vtt / .docx export included. 99 languages with automatic detection.
Two WAV-specific notes
- If your WAV is over 500 MB, consider compressing to M4A first (see the FFmpeg section below) — 10–20× smaller upload, negligible accuracy loss.
- If the WAV came from a non-standard source (some game engines, embedded devices, or telephony systems), it may not be plain PCM audio — see the sub-format section for the FFmpeg command to convert.
Why WAV = higher accuracy than MP3
This is the section nobody in the SERP writes because it requires actually understanding how audio compression interacts with AI transcription. Short version: WAV stores every audio sample at full precision. MP3 (and other lossy formats) throw away information the compressor considers “inaudible.” The problem: AI transcription models rely on some of that discarded information to distinguish similar-sounding consonants.
The mechanism
Lossy audio compression (MP3, AAC, Opus) works by mapping audio to a psychoacoustic model of human hearing and discarding frequencies humans can’t detect or barely perceive. At high bitrates (192+ kbps MP3, 256+ kbps AAC), almost nothing meaningful is lost. At low bitrates (64 kbps voice-optimized MP3), high frequencies above ~5–8 kHz are aggressively stripped.
AI transcription models trained on Whisper, Wav2Vec, and similar architectures use those high frequencies to distinguish consonants that sound similar to a compressed model but different in the raw waveform: f / v, s / z, p / b, t / d, ch / sh. Compressed audio loses subtle distinctions in the fricative (sibilant) range around 4–8 kHz.
WER comparison on identical source
| Source format | Typical WER on clean English | vs WAV |
|---|---|---|
| WAV (48 kHz PCM, 16-bit) | 3–5% | — |
| M4A (256 kbps AAC) | 3–6% | +0–1% |
| MP3 (192 kbps) | 4–7% | +1–2% |
| MP3 (128 kbps) | 5–8% | +2–4% |
| MP3 (64 kbps voice) | 8–13% | +4–8% |
| Cellular narrowband (8 kHz) | 15–25% | +10–20% |
When WAV’s advantage doesn’t matter
The advantage only matters if the source audio was high-quality to begin with. Cases where WAV vs MP3 makes no practical difference:
- Noisy source recording (laptop mic in a cafe, phone recording on a busy street). Ambient noise dominates whatever accuracy ceiling the format sets.
- Phone-quality source (already 8 kHz narrowband before recording).
- Very short clips (under 30 seconds). Random error dominates the small WER differences.
- Heavily accented or non-native speech where model uncertainty is the dominant error source.
The practical takeaway:if you have a studio recording, a podcast master, a professional interview, or any deliberate recording at 44.1+ kHz — keep it as WAV when possible for transcription. If you have an mp3 at 128+ kbps, don’t bother re-encoding — the accuracy delta is negligible.
File size — the honest math
Why WAV files are big: bytes per second = sample_rate × channels × (bit_depth / 8). Plug in the numbers for common WAV configurations:
| Configuration | Bytes/sec | Size per hour | Typical source |
|---|---|---|---|
| 8 kHz mono 16-bit | 16,000 | ~55 MB | Phone-quality voice archive |
| 22 kHz mono 16-bit | 44,000 | ~150 MB | Consumer voice recorder |
| 44.1 kHz mono 16-bit | 88,200 | ~300 MB | Podcast raw voice track |
| 44.1 kHz stereo 16-bit | 176,400 | ~635 MB | CD-quality music, most Zoom recordings |
| 48 kHz stereo 16-bit | 192,000 | ~690 MB | Broadcast standard, DSLR camera audio |
| 48 kHz stereo 24-bit | 288,000 | ~1.0 GB | Professional field recording |
| 96 kHz stereo 24-bit | 576,000 | ~1.9 GB | Studio session master |
| 192 kHz stereo 24-bit | 1,152,000 | ~3.9 GB | Audiophile / archival master |
Practical implications:
- A 1-hour meeting recorded on a Zoom H1n at 44.1 kHz stereo = ~635 MB. Uploads fine on typical broadband but is slow on hotel Wi-Fi.
- A 3-hour podcast master = ~1.9 GB. Compress to M4A first or split with FFmpeg.
- A 30-min studio session at 96 kHz 24-bit = ~1 GB. Downsample to 22 kHz mono before upload — transcription doesn’t use anything above ~8 kHz anyway.
Compress with FFmpeg before upload
Two commands cover 95% of cases. Install FFmpeg first (brew install ffmpeg on Mac, or the Windows FFmpeg downloads page).
Path 1 — Convert to M4A (recommended for most)
ffmpeg -i input.wav -c:a aac -b:a 128k output.m4aThis encodes the audio as AAC at 128 kbps inside an M4A container. Typical result: 10–20× smaller file than the source WAV, with negligible transcription accuracy loss (both formats are above the quality threshold that AI models care about for speech). Universally supported by every cloud transcription service.
Path 2 — Downsample to smaller WAV (if you need WAV format)
ffmpeg -i input.wav -acodec pcm_s16le -ar 22050 -ac 1 output.wavThis keeps the WAV format but downsamples to 22 kHz mono 16-bit. Result: ~6× smaller filethan 48 kHz stereo source. Speech transcription accuracy is preserved because AI models don’t use frequencies above ~8 kHz for speech anyway (Nyquist rate for 22 kHz sampling = 11 kHz, well above the speech-relevant range).
When each path fits
- M4A path: general upload optimization, slow connections, batch processing. Default choice.
- Downsampled WAV path: you need to keep WAV format for a downstream tool that requires it (some legacy analysis pipelines).
Splitting a very long WAV
If your file is over 5 GB or several hours, split into hour-long segments:
ffmpeg -i input.wav -f segment -segment_time 3600 -c copy part_%03d.wavProduces part_000.wav, part_001.wav, etc. — one file per hour of audio. Upload each separately and concatenate transcripts.
Free path — self-hosted Whisper
If your content is sensitive (court audio, protected health information, unreleased music masters, confidential business), don’t upload it to a cloud service. Run Whisper locally instead.
Install
pip install openai-whisperOr on Mac via Homebrew:
brew install openai-whisperRun
whisper input.wav --model large-v3 --output_format srtOutput formats: txt, srt, vtt, tsv, json, all.
Speed reality
- CPU (typical laptop):10–30× real-time on large-v3. A 1-hour file takes 10–30 hours. Painful.
- Apple Silicon (M1/M2/M3/M4):Near real-time on large-v3 using the MLX Whisper port. A 1-hour file takes 1–3 hours.
- NVIDIA GPU (RTX 3060+):Real-time to 2× faster. A 1-hour file takes 30–60 minutes.
When self-hosted Whisper fits
- Sensitive content that can’t leave your machine
- Airgap environments (courtrooms, sensitive research labs, intelligence work)
- One-time transcription of a large batch you don’t want to pay per-minute for
- Learning / experimentation
When it doesn’t fit
- You need speaker labels out of the box (Whisper alone doesn’t diarize — you’d need to add WhisperX)
- You need an in-browser editor for cleanup
- You need it fast on a machine without a GPU
- You don’t want to install Python + dependencies
WAV sub-format edge cases
The .wavfile extension doesn’t guarantee plain PCM audio inside. The RIFF fmtchunk in a WAV file’s header specifies which codec the audio uses. Most WAVs in the wild are standard PCM, but you can encounter these:
fmt code | Codec | Notes |
|---|---|---|
0x0001 | PCM (integer) | Standard; universally supported |
0x0003 | IEEE float | Common in DAW exports; most services accept |
0x0006 | A-law | Legacy telephony; some services reject |
0x0007 | µ-law | Legacy telephony; some services reject |
0x0011 | IMA ADPCM | Rare; needs conversion |
0x0055 | MPEG Layer 3 (MP3-in-WAV) | Unusual; parse-error risk |
0xFFFE | WAVEFORMATEXTENSIBLE | Multi-channel, HD audio; usually supported |
If you hit “unsupported codec” or “could not decode” on a WAV file, this is probably why. Convert to standard 16-bit PCM:
ffmpeg -i input.wav -acodec pcm_s16le output.wavThe pcm_s16le means signed 16-bit little-endian — the standard PCM format every parser recognizes. Zero information loss when going from higher bit-depths (float, 24-bit) except at very quiet noise floors nobody cares about for transcription.
Opus audio and WAV containers
Opus (RFC 6716, 2012) is a modern speech and audio codec that outperforms MP3 and AAC at low bitrates — particularly for voice. Opus files usually have .opus or .ogg extensions, but Opus can also live inside a WAV container in edge cases (some VoIP systems, some embedded devices).
How to tell if your WAV contains Opus
Check the fmt chunk with FFmpeg or MediaInfo:
ffprobe input.wavThe output shows the codec. If it says opus instead of pcm_s16le, you have Opus-in-WAV. Some transcription services still handle this transparently; others reject it. Convert to standard PCM first if in doubt:
ffmpeg -i input.wav -acodec pcm_s16le -ar 16000 -ac 1 output.wavTranscribing pure Opus files
If your source is .opus or .ogg (Opus-in-Ogg), DeluxeScribe accepts both directly — no conversion needed. Speech-optimized Opus at 32 kbps is roughly transcription-equivalent to 128 kbps MP3 (Opus is that much more efficient for voice), so accuracy from .opus is comparable to WAV.
When another tool fits better
Being honest about the sub-cases where DeluxeScribe isn’t the right pick for your WAV file:
- Court reporting / evidentiary work. Use human transcription (Rev human tier, GoTranscript) — AI is adequate for most content but insufficient when a judge may examine the transcript. Human review at $1.50–$3.00 per audio minute is the correct choice here.
- Extremely sensitive content (unreleased music masters, PHI, classified). Use self-hosted Whisper to avoid upload entirely — see §6 above.
- Very large batches (hundreds of files, thousands of hours). At that scale, OpenAI Whisper API directly or a self-hosted batch pipeline is cheaper per minute than any consumer transcription service.
- Different source format. If your file is actually MP3, use MP3 to Text — same tool, format-specific notes. If it’s an M4A from iPhone Voice Memos, see M4A to Text or iPhone Voice Memo Transcription.
- Video source. If you have a video file with WAV-quality audio embedded, extract the audio first (
ffmpeg -i input.mp4 -vn -acodec copy output.m4a) and upload the audio-only track. See Video to Text for the full workflow.
How this page was verified
fmt chunk values) verified against Microsoft’s WAV format documentation and the WAVE format reference. FFmpeg command syntax comes from the official FFmpeg manual. Accuracy comparisons come from our own testing of 8 sample recordings encoded to WAV, high-bitrate MP3 (192/128/64 kbps), and M4A (128 kbps AAC), transcribed through DeluxeScribe against reference human transcripts — see How Accurate Is Whisper for the broader WER-by-condition breakdown. We don’t cite the “99% accuracy” and “98% accurate in 150+ languages” claims common on this SERP because we can’t source them to a published benchmark on WAV audio specifically. Opus codec details come from RFC 6716.Related guides
- MP3 to TextSibling format guide — MP3 is the compressed counterpart to WAV. Covers bitrate reality, accuracy tradeoffs, and free options.
- M4A to TextApple's default audio container. Best for iPhone Voice Memos and AAC-encoded recordings.
- How Accurate Is WhisperWER benchmarks by audio condition and format. Why WAV, MP3, and phone-audio accuracy differ.
- How to Transcribe Audio (pillar)The broader pillar — every path across formats and providers and how to pick.