WAV to Text: highest-accuracy audio transcription, and the file-size math nobody publishes

WAV is uncompressed audio — the highest-fidelity input for AI transcription. Here's the workflow, why WAV produces more accurate text than MP3, and what to do when the file is too big to upload.

WAV is universally supported— every cloud transcription service accepts it, and it produces the highest-accuracy transcription of any audio format because no lossy compression has stripped high-frequency information. Upload the file, wait 1–3 minutes for typical durations, get a transcript with speaker labels and .srt / .docx export. The tradeoff: WAV files are large — a 1-hour recording at 44.1 kHz stereo weighs ~600 MB, and studio recordings at 96 kHz 24-bit can hit 1.9 GB per hour. DeluxeScribe accepts WAV files up to 5 GB with 99-language auto-detection, 60 minutes free, no credit card. Below: the workflow, the file-size math, the FFmpeg compression command for when the file is too big, and the WAV sub-format edge cases that trip up simple parsers.

60 minutes free
No credit card
99 languages
Speaker labels

Last verified July 4, 2026

TL;DR — pick your path

Your situation	Best path	Cost
Small WAV (< 100 MB), one-off	Upload to DeluxeScribe	60 min free, then ~$10/mo
Big WAV (> 500 MB)	Compress with FFmpeg to M4A → upload	Free (FFmpeg) + service tier
Studio recording (96 kHz 24-bit, multi-hour)	Downsample with FFmpeg, then upload	Free (FFmpeg) + service tier
Batch of many WAVs	Sign up, use bulk upload	~$10/mo subscription
Sensitive content (court audio, PHI, unreleased)	Self-hosted Whisper — no upload	Free after Python setup
Non-PCM WAV (Opus, µ-law, ADPCM)	Convert to PCM first with FFmpeg → upload	Free (FFmpeg) + service tier
Court reporting / evidentiary	Human transcription (Rev human, GoTranscript)	$1.50–$3.00/min
Just want text fast, don’t care about accuracy	Any tool — WAV is universally accepted	Varies

Upload workflow — the 4 steps

Same workflow as any other cloud transcription service, with two WAV-specific notes at the end.

Sign up (60 minutes free, no credit card).
Drag the .wav into the upload area, or click to browse. Files up to 5 GB are accepted — that covers roughly 8 hours at CD quality (44.1 kHz stereo) or 15 hours of mono 22 kHz.
Language auto-detects. Leave it unless you want to force a specific dialect. Speaker labels default on; toggle off for solo recordings.
Wait 1–3 minutes for typical files (30-60 min recording). Multi-hour files scale roughly linearly.
Review and export.Fix any mis-heard proper nouns, technical terms, or numbers in the browser editor (usually 2–5 minutes of cleanup on a 30-minute file). Export as TXT, DOCX, PDF, SRT, VTT, or JSON with word-level timestamps.

Upload a WAV file and get text in minutes

60 minutes free, no credit card. WAV up to 5 GB accepted. Speaker labels, timestamps, and .srt / .vtt / .docx export included. 99 languages with automatic detection.

Two WAV-specific notes

If your WAV is over 500 MB, consider compressing to M4A first (see the FFmpeg section below) — 10–20× smaller upload, negligible accuracy loss.
If the WAV came from a non-standard source (some game engines, embedded devices, or telephony systems), it may not be plain PCM audio — see the sub-format section for the FFmpeg command to convert.

Why WAV = higher accuracy than MP3

This is the section nobody in the SERP writes because it requires actually understanding how audio compression interacts with AI transcription. Short version: WAV stores every audio sample at full precision. MP3 (and other lossy formats) throw away information the compressor considers “inaudible.” The problem: AI transcription models rely on some of that discarded information to distinguish similar-sounding consonants.

The mechanism

Lossy audio compression (MP3, AAC, Opus) works by mapping audio to a psychoacoustic model of human hearing and discarding frequencies humans can’t detect or barely perceive. At high bitrates (192+ kbps MP3, 256+ kbps AAC), almost nothing meaningful is lost. At low bitrates (64 kbps voice-optimized MP3), high frequencies above ~5–8 kHz are aggressively stripped.

AI transcription models trained on Whisper, Wav2Vec, and similar architectures use those high frequencies to distinguish consonants that sound similar to a compressed model but different in the raw waveform: f / v, s / z, p / b, t / d, ch / sh. Compressed audio loses subtle distinctions in the fricative (sibilant) range around 4–8 kHz.

WER comparison on identical source

Source format	Typical WER on clean English	vs WAV
WAV (48 kHz PCM, 16-bit)	3–5%	—
M4A (256 kbps AAC)	3–6%	+0–1%
MP3 (192 kbps)	4–7%	+1–2%
MP3 (128 kbps)	5–8%	+2–4%
MP3 (64 kbps voice)	8–13%	+4–8%
Cellular narrowband (8 kHz)	15–25%	+10–20%

When WAV’s advantage doesn’t matter

The advantage only matters if the source audio was high-quality to begin with. Cases where WAV vs MP3 makes no practical difference:

Noisy source recording (laptop mic in a cafe, phone recording on a busy street). Ambient noise dominates whatever accuracy ceiling the format sets.
Phone-quality source (already 8 kHz narrowband before recording).
Very short clips (under 30 seconds). Random error dominates the small WER differences.
Heavily accented or non-native speech where model uncertainty is the dominant error source.

The practical takeaway:if you have a studio recording, a podcast master, a professional interview, or any deliberate recording at 44.1+ kHz — keep it as WAV when possible for transcription. If you have an mp3 at 128+ kbps, don’t bother re-encoding — the accuracy delta is negligible.

File size — the honest math

Why WAV files are big: bytes per second = sample_rate × channels × (bit_depth / 8). Plug in the numbers for common WAV configurations:

Configuration	Bytes/sec	Size per hour	Typical source
8 kHz mono 16-bit	16,000	~55 MB	Phone-quality voice archive
22 kHz mono 16-bit	44,000	~150 MB	Consumer voice recorder
44.1 kHz mono 16-bit	88,200	~300 MB	Podcast raw voice track
44.1 kHz stereo 16-bit	176,400	~635 MB	CD-quality music, most Zoom recordings
48 kHz stereo 16-bit	192,000	~690 MB	Broadcast standard, DSLR camera audio
48 kHz stereo 24-bit	288,000	~1.0 GB	Professional field recording
96 kHz stereo 24-bit	576,000	~1.9 GB	Studio session master
192 kHz stereo 24-bit	1,152,000	~3.9 GB	Audiophile / archival master

Practical implications:

A 1-hour meeting recorded on a Zoom H1n at 44.1 kHz stereo = ~635 MB. Uploads fine on typical broadband but is slow on hotel Wi-Fi.
A 3-hour podcast master = ~1.9 GB. Compress to M4A first or split with FFmpeg.
A 30-min studio session at 96 kHz 24-bit = ~1 GB. Downsample to 22 kHz mono before upload — transcription doesn’t use anything above ~8 kHz anyway.

Compress with FFmpeg before upload

Two commands cover 95% of cases. Install FFmpeg first (brew install ffmpeg on Mac, or the Windows FFmpeg downloads page).

Path 1 — Convert to M4A (recommended for most)

ffmpeg -i input.wav -c:a aac -b:a 128k output.m4a

This encodes the audio as AAC at 128 kbps inside an M4A container. Typical result: 10–20× smaller file than the source WAV, with negligible transcription accuracy loss (both formats are above the quality threshold that AI models care about for speech). Universally supported by every cloud transcription service.

Path 2 — Downsample to smaller WAV (if you need WAV format)

ffmpeg -i input.wav -acodec pcm_s16le -ar 22050 -ac 1 output.wav

This keeps the WAV format but downsamples to 22 kHz mono 16-bit. Result: ~6× smaller filethan 48 kHz stereo source. Speech transcription accuracy is preserved because AI models don’t use frequencies above ~8 kHz for speech anyway (Nyquist rate for 22 kHz sampling = 11 kHz, well above the speech-relevant range).

When each path fits

M4A path: general upload optimization, slow connections, batch processing. Default choice.
Downsampled WAV path: you need to keep WAV format for a downstream tool that requires it (some legacy analysis pipelines).

Splitting a very long WAV

If your file is over 5 GB or several hours, split into hour-long segments:

ffmpeg -i input.wav -f segment -segment_time 3600 -c copy part_%03d.wav

Produces part_000.wav, part_001.wav, etc. — one file per hour of audio. Upload each separately and concatenate transcripts.

Free path — self-hosted Whisper

If your content is sensitive (court audio, protected health information, unreleased music masters, confidential business), don’t upload it to a cloud service. Run Whisper locally instead.

Install

pip install openai-whisper

Or on Mac via Homebrew:

brew install openai-whisper

Run

whisper input.wav --model large-v3 --output_format srt

Output formats: txt, srt, vtt, tsv, json, all.

Speed reality

CPU (typical laptop):10–30× real-time on large-v3. A 1-hour file takes 10–30 hours. Painful.
Apple Silicon (M1/M2/M3/M4):Near real-time on large-v3 using the MLX Whisper port. A 1-hour file takes 1–3 hours.
NVIDIA GPU (RTX 3060+):Real-time to 2× faster. A 1-hour file takes 30–60 minutes.

When self-hosted Whisper fits

Sensitive content that can’t leave your machine
Airgap environments (courtrooms, sensitive research labs, intelligence work)
One-time transcription of a large batch you don’t want to pay per-minute for
Learning / experimentation

When it doesn’t fit

You need speaker labels out of the box (Whisper alone doesn’t diarize — you’d need to add WhisperX)
You need an in-browser editor for cleanup
You need it fast on a machine without a GPU
You don’t want to install Python + dependencies

WAV sub-format edge cases

The .wavfile extension doesn’t guarantee plain PCM audio inside. The RIFF fmtchunk in a WAV file’s header specifies which codec the audio uses. Most WAVs in the wild are standard PCM, but you can encounter these:

`fmt` code	Codec	Notes
`0x0001`	PCM (integer)	Standard; universally supported
`0x0003`	IEEE float	Common in DAW exports; most services accept
`0x0006`	A-law	Legacy telephony; some services reject
`0x0007`	µ-law	Legacy telephony; some services reject
`0x0011`	IMA ADPCM	Rare; needs conversion
`0x0055`	MPEG Layer 3 (MP3-in-WAV)	Unusual; parse-error risk
`0xFFFE`	WAVEFORMATEXTENSIBLE	Multi-channel, HD audio; usually supported

If you hit “unsupported codec” or “could not decode” on a WAV file, this is probably why. Convert to standard 16-bit PCM:

ffmpeg -i input.wav -acodec pcm_s16le output.wav

The pcm_s16le means signed 16-bit little-endian — the standard PCM format every parser recognizes. Zero information loss when going from higher bit-depths (float, 24-bit) except at very quiet noise floors nobody cares about for transcription.

Opus audio and WAV containers

Opus (RFC 6716, 2012) is a modern speech and audio codec that outperforms MP3 and AAC at low bitrates — particularly for voice. Opus files usually have .opus or .ogg extensions, but Opus can also live inside a WAV container in edge cases (some VoIP systems, some embedded devices).

How to tell if your WAV contains Opus

Check the fmt chunk with FFmpeg or MediaInfo:

ffprobe input.wav

The output shows the codec. If it says opus instead of pcm_s16le, you have Opus-in-WAV. Some transcription services still handle this transparently; others reject it. Convert to standard PCM first if in doubt:

ffmpeg -i input.wav -acodec pcm_s16le -ar 16000 -ac 1 output.wav

Transcribing pure Opus files

If your source is .opus or .ogg (Opus-in-Ogg), DeluxeScribe accepts both directly — no conversion needed. Speech-optimized Opus at 32 kbps is roughly transcription-equivalent to 128 kbps MP3 (Opus is that much more efficient for voice), so accuracy from .opus is comparable to WAV.

When another tool fits better

Being honest about the sub-cases where DeluxeScribe isn’t the right pick for your WAV file:

Court reporting / evidentiary work. Use human transcription (Rev human tier, GoTranscript) — AI is adequate for most content but insufficient when a judge may examine the transcript. Human review at $1.50–$3.00 per audio minute is the correct choice here.
Extremely sensitive content (unreleased music masters, PHI, classified). Use self-hosted Whisper to avoid upload entirely — see §6 above.
Very large batches (hundreds of files, thousands of hours). At that scale, OpenAI Whisper API directly or a self-hosted batch pipeline is cheaper per minute than any consumer transcription service.
Different source format. If your file is actually MP3, use MP3 to Text — same tool, format-specific notes. If it’s an M4A from iPhone Voice Memos, see M4A to Text or iPhone Voice Memo Transcription.
Video source. If you have a video file with WAV-quality audio embedded, extract the audio first (ffmpeg -i input.mp4 -vn -acodec copy output.m4a) and upload the audio-only track. See Video to Text for the full workflow.

How this page was verified

WAV format specification (RIFF container, fmt chunk values) verified against Microsoft’s WAV format documentation and the WAVE format reference. FFmpeg command syntax comes from the official FFmpeg manual. Accuracy comparisons come from our own testing of 8 sample recordings encoded to WAV, high-bitrate MP3 (192/128/64 kbps), and M4A (128 kbps AAC), transcribed through DeluxeScribe against reference human transcripts — see How Accurate Is Whisper for the broader WER-by-condition breakdown. We don’t cite the “99% accuracy” and “98% accurate in 150+ languages” claims common on this SERP because we can’t source them to a published benchmark on WAV audio specifically. Opus codec details come from RFC 6716.

Related guides

Frequently Asked Questions

What is a WAV file?

WAV (Waveform Audio File Format) is Microsoft and IBM's uncompressed audio container from 1991. The audio inside is usually PCM — pulse-code modulation — which stores every audio sample at full precision with no compression. That's why WAV files are large (~600 MB per hour at CD quality) but produce the highest-fidelity source for AI transcription.

Why is my WAV file so big?

Because it's uncompressed. Math: bytes per second = sample_rate × channels × (bit_depth / 8). A 1-hour recording at 44.1 kHz stereo 16-bit works out to 44,100 × 2 × 2 = 176,400 bytes/sec × 3,600 sec = ~635 MB. Studio recordings at 96 kHz 24-bit go up to ~1.9 GB per hour. This is normal, not a bug.

Does WAV produce more accurate transcription than MP3?

Yes, on identical source audio, but the margin depends on the MP3 bitrate. WAV vs 192 kbps MP3: 1-2% WER difference (usually not worth caring about). WAV vs 128 kbps MP3: 2-4% WER difference. WAV vs 64 kbps voice MP3: 4-8% WER difference — genuinely meaningful. The reason: lossy compression strips high-frequency information above 4-8 kHz where consonant distinctions live (f/v, s/z, p/b). The advantage disappears if the source recording is already low-quality — a laptop mic in a noisy cafe has no accuracy ceiling to preserve.

What's the largest WAV file I can upload to DeluxeScribe?

5 GB, which covers roughly 8 hours at CD quality (44.1 kHz stereo) or 15 hours of mono 22 kHz. Beyond that, split the file with FFmpeg (ffmpeg -i input.wav -f segment -segment_time 3600 -c copy part_%03d.wav) and upload the parts separately. For batch workflows we recommend compressing to M4A first — same accuracy, 10x smaller uploads.

Should I compress my WAV before uploading?

Yes, if the file is over 500 MB or you're on a slow connection. Compressing to 128 kbps AAC (M4A container) reduces file size by 10-20x with negligible transcription accuracy loss. The command: ffmpeg -i input.wav -c:a aac -b:a 128k output.m4a. Both AAC 128 kbps and WAV are above the quality threshold that AI transcription models care about — the accuracy difference is nearly always <1% WER. Skip compression if the file is small (<200 MB) or if you need to preserve WAV for a downstream tool.

Can I transcribe a WAV file for free?

Yes, two paths. (1) Self-hosted Whisper: pip install openai-whisper, then whisper input.wav --model large-v3 --output_format srt. Free, private, runs on your machine. Speed is 10-30x real-time on CPU, near real-time on Apple Silicon or NVIDIA GPU. (2) DeluxeScribe free tier: 60 minutes free one-time, no credit card. Choose (1) if the content is sensitive; (2) if you want speaker labels and a browser editor.

How do I convert WAV to text on Windows?

Three options ordered by ease. (1) Any cloud transcription service accepts WAV — sign up, upload, done. DeluxeScribe gives 60 minutes free. (2) Windows built-in dictation (Win+H) transcribes live but doesn't process saved WAV files — not the right tool. (3) Self-hosted Whisper works on Windows via pip install openai-whisper (requires Python) or a compiled Whisper binary. For non-technical users, option 1 is the answer.

How do I convert WAV to text on Mac?

Same three options as Windows, plus one Mac-specific: macOS Sequoia 14.7+ on Apple Silicon can transcribe played-back audio through Voice Memos (roundabout — you'd play the WAV into a new Voice Memos recording). More practically, use any cloud service that accepts WAV (60 minutes free on DeluxeScribe) or run whisper input.wav locally after brew install openai-whisper. Apple Silicon Macs run Whisper large-v3 at near real-time speeds.

Does DeluxeScribe accept WAV files?

Yes. WAV with any standard PCM configuration (8-bit, 16-bit, 24-bit, 32-bit float) is accepted directly. Non-PCM WAV variants (µ-law, A-law, ADPCM, or MP3-in-WAV) may need conversion first — see the sub-format section above for details. Upload path: drag the .wav into the dashboard, language auto-detects, transcript ready in 1-3 minutes for typical files.

Why did my WAV file fail to upload?

Three common causes. (1) File exceeds the 5 GB upload limit — split it with FFmpeg. (2) The WAV contains non-PCM audio (Opus-in-WAV, µ-law, ADPCM) that some parsers reject — convert to standard PCM with ffmpeg -i input.wav -acodec pcm_s16le output.wav. (3) Slow or interrupted connection timed out mid-upload — compress to M4A first (much smaller upload) or use a wired connection. If none of these apply, the file may be corrupted; try opening in Audacity to verify.