MP4 to Text: How to extract transcripts from video files
Three methods, a free YouTube workaround, and the audio-extraction trick that handles 4 GB Zoom exports.
- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified June 23, 2026
The 3 best ways to transcribe an MP4
| Method | Cost | Speed | Quality | Privacy | Max size |
|---|---|---|---|---|---|
| AI transcription service (DeluxeScribe et al.) | Free trial → $10/mo | 5–10 min per hour | High | Cloud-encrypted | 5 GB |
| YouTube unlisted upload + caption download | Free | Slow (queue + processing) | Medium | Public-with-URL | 256 GB |
| Self-hosted Whisper + ffmpeg | Free | CPU: 10–30× real-time | High | Local | Your disk |
1. AI transcription service (fastest, paid)
Upload the MP4 directly, get a clickable transcript with speaker labels, export as TXT, DOCX, PDF, SRT, VTT, or JSON. Best for: webinar replays, course content, multi-language videos, anything you need fast or in batch. DeluxeScribe’s 60-minute free tier covers a typical webinar or recorded lecture without paying.
2. YouTube unlisted upload + caption download (free, slow)
Cover this in more detail below. Short version: upload as unlisted, wait for auto-captions, download the SRT. Free, but slow and not actually private.
3. Self-hosted Whisper + ffmpeg (free, technical)
Whisper accepts MP4 directly (it calls ffmpeg internally), so you don’t need a separate extraction step:
pip install openai-whisper whisper webinar.mp4 --model large-v3 --output_format srt
Same accuracy as commercial services, fully private, no recurring cost. The catch is the same as for MP3: hours on a CPU, minutes on a GPU. For sensitive content where uploading isn’t acceptable, this is the right answer.
Do you need to extract the audio first?
Short answer: depends on the tool. Modern cloud services accept MP4 directly. Some older tools, file size limits, or slow uploads make extraction worthwhile.
Extraction with ffmpeg is a one-liner — it copies the audio stream out of the container without re-encoding, so there’s no quality loss:
ffmpeg -i video.mp4 -vn -acodec copy audio.m4a
-vn drops the video, -acodec copy keeps the audio bit-for-bit identical. A 4 GB MP4 typically extracts to 50–100 MB of M4A.
When extraction helps:
- File size limits (you have a 6 GB MP4, service caps at 5 GB)
- Slow upload connection (uploading 50 MB beats uploading 4 GB)
- Privacy-sensitive content (extract locally, only upload audio)
- Batch processing pipelines where MP4 has unnecessary video bandwidth cost
When extraction doesn’t help:most normal use. If your MP4 is under the service’s upload limit and you have a decent connection, just upload the MP4.
The free YouTube workaround (and its tradeoffs)
Upload your MP4 to YouTube as “unlisted”, wait for auto-captions to finish processing (5–60 minutes, depending on length and queue), then download the SRT from Subtitles → three-dot menu → Download.
When this works well:
- 5–15 minute clear English videos
- Screen recordings with system audio (no background noise)
- Public-facing content you’d post on YouTube anyway
When it doesn’t:
- Sensitive content (see privacy caveat below)
- Non-English audio (caption quality drops significantly)
- Multi-speaker recordings (no speaker labels)
- You need a transcript in the next 10 minutes
The unspoken catch:“unlisted” on YouTube means the video doesn’t appear in search or channel listings — but anyone with the URL can watch it. URLs leak. Don’t use this for client recordings, internal company calls, or anything with personal data.
Accuracy by MP4 source
The video container is irrelevant to accuracy — it’s entirely about the audio inside it. Common MP4 sources, with realistic expectations:
| Source | Realistic accuracy | Notes |
|---|---|---|
| Screen recording, system audio only | 95–98% | No room noise; near-perfect signal. |
| Zoom export (downloaded recording) | 92–97% | Separate audio tracks help diarization. |
| Webcam recording, decent mic | 90–95% | Room reverb adds 2–5% error. |
| Smartphone video, indoor | 85–92% | Phone mic is omnidirectional; picks up everything. |
| Phone video, outdoor / handheld | 70–85% | Wind and crowd noise dominate. |
| Action cam / motorcycle / wind | 50–75% | Often faster to add subtitles by hand. |
Handling large MP4 files
Most services cap individual uploads at 1–5 GB. A 60-minute 1080p MP4 is typically 1.5–3 GB; a 4K screen recording is 5–10 GB. Two options:
Extract audio first
Cuts file size by 95% with no transcription quality loss. One ffmpeg command (above). Best for almost all video-with-talking content.
Split the long video
For multi-hour content, split into chunks so each fits the size limit:
ffmpeg -i long.mp4 -c copy -map 0 -segment_time 1800 -f segment chunk_%03d.mp4
Produces 30-minute chunks. Upload sequentially; the transcripts can be concatenated afterward.
Common MP4 transcription failures
“No audio detected”
The MP4 has no audio track, the audio track is muted at the container level, or the codec is unsupported. Check with ffprobe video.mp4 — look for an audio stream in the output. If missing, the recording itself dropped audio.
Wrong language detected
Auto-detection runs on the first ~30 seconds of audio. If the video opens with English narration before the actual non-English content, set the language manually.
Missing dialogue (background music too loud)
Heavy music underscoring drowns out speech for ASR the same way it does for humans. Two fixes: source the original audio before mixing, or use a vocal-isolation tool (e.g. Ultimate Vocal Remover) to strip music first.
When DeluxeScribe is the right tool
The right answer is uswhen you have webinar recordings, course videos, multi-language Zoom calls, or anything where you’d otherwise upload to YouTube just to get the captions. We accept MP4 directly up to 5 GB, produce speaker labels, and export every common format in one run.
The right answer isn’t uswhen it’s a single 30-second screen recording — extract audio and run Whisper locally if you do this often, or just type it. And for sensitive recordings, even an encrypted upload is one cloud too many — run Whisper offline instead.
How this page was verified
Related guides
- MP3 to TextConvert audio recordings to text. Accuracy by recording type and bitrate truth.
- M4A to TextiPhone Voice Memos and other M4A files. Covers iOS 18's built-in transcription.
- SRT GeneratorGenerate subtitle files with the timing rules Netflix and BBC use.
- Pricing60 minutes free, then $10/month for 600 minutes.