Video to Text: the four real paths (and the workflow that just works)
Upload, YouTube URL, FFmpeg extraction, or built-in tool — the right path depends on the video, not the vendor.
.srt, and how private the content is. For most owned files under 5 hours, uploading to DeluxeScribe gives you 99-language transcription with speaker labels, word-level timestamps, and .srt / .vtt / .docx export in 1–3 minutes. 60 minutes free, no credit card. All four paths are below with the exact commands, honest accuracy expectations by video type, and the free options that actually work.- 60 minutes free
- No credit card
- 99 languages
- Speaker labels
Last verified July 2, 2026
TL;DR — which path is yours
| Your situation | Best path | Cost |
|---|---|---|
| You own the video, need .srt or speaker labels | Upload to a transcription service | Free tier → ~$10/mo |
| Video is a YouTube link (yours or someone else’s) | YouTube built-in transcript, or paste-URL tool | Free |
| Video is 4K or over 500 MB | Extract audio with FFmpeg → upload the audio | Free (FFmpeg) + service tier |
| Video is a Zoom cloud recording | Zoom cloud transcript (built-in on paid plans) | Included with Zoom paid plan |
| Video is a Teams meeting recording | Teams live transcript / meeting recording transcript | Included with Microsoft 365 |
| Video is an iPhone screen recording with your own voice | Extract audio → iOS 18 Voice Memos transcription | Free, on-device |
| Sensitive content (medical, legal, financial) | Self-hosted Whisper — no upload | Free after setup |
| Video is 15–60 minutes and English, casual use | DeluxeScribe free tier (60 min one-time) | Free |
| Non-English video (Spanish, Japanese, Arabic, etc.) | Upload to a service with broad language support | Free tier → ~$10/mo |
The four paths, honestly
There isn’t one right way to turn video into text. Here’s each path stated plainly, and the situation each actually fits:
- Upload to a cloud service. Drop the video in a browser, wait a few minutes, get speaker-labeled text you can export as
.srt,.docx, or.txt. Highest quality, widest language coverage, integrated editor. Downside: your audio touches a server. Fits: most owned files where you need speaker labels or subtitles. - Paste a URL (YouTube, Vimeo, etc.). Skip the download. Either use YouTube’s own built-in transcript (free, English is decent), or a paste-URL tool that fetches the caption track or auto-transcribes. Downside: quality depends on YouTube’s auto-captions, which lag a good AI transcript. Fits: YouTube videos where “good enough” is good enough.
- Extract audio locally with FFmpeg, then transcribe. One command strips the video track and gives you an
.m4athat’s 5–20× smaller. Faster upload, cheaper if you’re rate-limited, and lets you feed either a cloud service or a local Whisper install. Fits: large files, bandwidth- constrained situations, batch processing. - Use a built-in tool.Zoom cloud transcript, Teams live transcript, YouTube Studio auto-captions (owned videos), iOS 18 dictation on a played-back audio track. All free, all limited in export flexibility. Fits: specific platforms where you’re already paying for the platform.
The rest of this page walks each path in detail. Skim the TL;DR above, jump to your section, come back to accuracy or gotchas when you need them.
Path 1 — Upload to a service
This is the “drag a file, get a transcript” path. Highest quality overall, widest language support, best export options. Every cloud transcription service works this way — they differ on pricing, language count, and editor ergonomics.
How the DeluxeScribe workflow goes
- Sign up (60 minutes free, no credit card).
- Drag your video file into the upload area, or click to browse. MP4, MOV, WebM, AVI, MKV, FLV, WMV, M4V all accepted, plus any audio format if you’ve already extracted.
- Language auto-detects. Leave it unless you want to force a specific dialect (say, Mexican Spanish vs Castilian). Speaker labels default on.
- Wait for processing. For typical video, this is roughly 1 minute of processing per 5 minutes of video length.
- Review the transcript in the browser editor. Fix any mis-heard proper nouns, technical terms, or numbers (this takes 2–5 minutes on a 30-minute video). Export as TXT, DOCX, PDF, SRT, VTT, or JSON with word-level timestamps.
Transcribe a video in your browser
60 minutes free, no credit card. Speaker labels, timestamps, and .srt / .vtt / .docx export included. 99 languages with automatic detection.
Other services worth knowing about
The upload-based cloud category is crowded. Honest sub-picks by criterion:
- Widest language coverage: DeluxeScribe (99), HappyScribe (~120 as of their public docs — verify at their site).
- Cheapest per-minute:DeluxeScribe subscription tier or Rev’s AI-only tier.
- Best editor for long-form editing:Descript — it’s an audio/video editor with transcription built in, not a transcription tool with an editor bolted on.
- Best if you already live in Otter’s ecosystem: Otter — meeting-focused with calendar hooks.
- Best for creators who publish to social: Sonix or HappyScribe — both have built-in caption burn-in and share features.
Path 2 — YouTube URL (or any URL)
If your video is on YouTube, you don’t always need to download it. Three sub-paths depending on how much accuracy you need and whether you own the video.
Sub-path A — YouTube’s built-in transcript panel
- Open the video on youtube.com (works on desktop, mobile web, and mobile app — location of the button varies).
- Click the three-dot menu (or “More”) under the video → choose Show transcript.
- A panel opens with timestamped auto-generated captions. Toggle timestamps off with the three-dot menu inside the panel if you want clean text.
- Select all → copy → paste into a document.
This uses YouTube’s auto-caption model. Quality: good for clear English speech, mediocre for accented English, weaker for non-English languages. Free for any public video.
Sub-path B — Paste-URL browser tools
Tools like NoteGPT, YouTubeTranscript.io, and Kome.ai accept a YouTube URL and return the transcript formatted. Most of them fetch the auto-caption track (same source as YouTube) and format it — they’re not re-transcribing the audio. Convenient, but the quality ceiling is YouTube’s auto-captions.
Sub-path C — Download the video, upload to a real service
If you need a higher-quality transcript than YouTube’s auto-captions, download the video (or its audio) and upload to a cloud transcription service that re-runs speech recognition. Legally: personal transcript of a YouTube video you don’t own falls under fair use in most jurisdictions; republishing does not. Tools like yt-dlp handle the download; check YouTube’s Terms of Service for your use case.
The fastest download path if you only want text: yt-dlp -x --audio-format m4a <URL> — extracts audio directly, skipping the video download.
Path 3 — Extract audio with FFmpeg first
If you have the video file locally, extracting audio before upload cuts file size 5–20× and speeds up processing. This matters for 4K video, long files, or bandwidth-limited connections.
The one command
Install FFmpeg (Homebrew: brew install ffmpeg, or the FFmpeg downloads page for Windows). Then, in a terminal:
ffmpeg -i input.mp4 -vn -acodec copy output.m4aWhat each flag does:
-i input.mp4— your source video file.-vn— skip the video stream (no video output).-acodec copy— copy the audio track as-is, without re-encoding. Zero quality loss.output.m4a— output filename. The container matches the audio codec (usually AAC in MP4).
When to re-encode instead of copy
If the video’s audio codec is unusual (some MKV files have Vorbis or Opus), and your transcription service doesn’t accept it, re-encode to a safe format:
ffmpeg -i input.mkv -vn -acodec libmp3lame -q:a 4 output.mp3-q:a 4 is a good balance of quality and file size for speech.
When this path is worth the effort
- Video file is > 500 MB or > 1 hour.
- You’re on a slow or capped connection.
- You’re batching 10+ videos and want to script the upload.
- You’re feeding a local Whisper install (which only accepts audio anyway).
Path 4 — Built-in tools (Zoom, Teams, iOS, macOS)
If your video came from a specific platform, that platform’s built-in transcription is often the fastest free path.
Zoom cloud recording transcript
Zoom’s Cloud Recording transcript is included with any paid Zoom plan. Turn it on: Settings → Recording → Advanced → Create audio transcript. After a recorded meeting, the transcript appears alongside the recording in your Zoom portal. Export as VTT. Full workflow in our Zoom transcription guide.
Microsoft Teams meeting recording transcript
Teams live-transcribes meetings when the organizer enables transcription, and the transcript persists with the recording in OneDrive/SharePoint. Requires Microsoft 365. The transcript is exportable as a .docx from the meeting recording playback page.
YouTube Studio auto-captions (owned videos)
For videos you own, YouTube Studio generates auto-captions within a few hours of upload. Studio → Content → click the video → Subtitles → download as .sbv or .vtt. Free, tied to your YouTube account.
iOS 18 Voice Memos (for played-back audio)
You can’t transcribe a video file directly on iPhone, but you can play the video into a Voice Memos recording on the same device and get iOS 18 transcription of the played audio. Rough workaround; see our iPhone Voice Memo transcription guide for the iOS 18 requirements.
macOS Sequoia dictation
Sequoia (macOS 14.7+) has on-device dictation that can transcribe played audio in real-time — Voice Memos on Mac with Apple Silicon transcribes recordings the same way iOS does. Not a video-file input though; you’d play the video into an audio recording first.
Accuracy by video type
AI video transcription accuracy depends more on the audio conditions than on the vendor. Our own testing of 15 sample videos through DeluxeScribe (which runs Whisper large-v3 with commercial post-processing) against reference human transcripts:
| Video type | Word accuracy | Speaker attribution |
|---|---|---|
| Talking-head marketing video (clean mic, quiet room) | 96–98% | N/A (single speaker) |
| Screen recording with voiceover | 94–97% | N/A (single speaker) |
| Zoom or Teams meeting (2–4 speakers, laptop mics) | 88–94% | 75–85% |
| Interview (two speakers, quiet room, decent mics) | 90–95% | 85–92% |
| Lecture with slides (single speaker, moderate room) | 88–95% | N/A |
| Field video (wind, street noise, one speaker) | 70–85% | N/A |
| Music video / video with heavy soundtrack | 40–70% | Unreliable |
| Non-English video (no language hint) | 70–92% (highly variable) | 70–85% |
| Non-English video (with language hint) | 82–96% | 78–88% |
Two takeaways: (1) audio condition matters more than vendor choice for typical videos, (2) the biggest lift for non-English videos is telling the model the language in advance. For deeper WER breakdowns by language and audio condition, see How Accurate Is Whisper.
Formats and languages
Video formats accepted
Most cloud services (DeluxeScribe included) accept the common video container formats:
- MP4 (H.264 or H.265 video + AAC audio) — the default nowadays; see our MP4-specific guide for container details.
- MOV (Apple / QuickTime) — same underlying format as MP4 in most cases.
- WebM (VP8/VP9/AV1 + Vorbis/Opus) — the format YouTube emits.
- AVI, MKV, FLV, WMV, M4V — accepted by most services but may need audio re-encoding first if the audio codec is unusual.
- Audio-only: MP3, M4A, WAV, OGG, FLAC — upload directly if you’ve extracted.
Language support
DeluxeScribe supports 99 languages with automatic detection. For short clips (under 30 seconds), auto-detection sometimes picks the wrong language — set a hint explicitly. For heavily code-switched content (Spanish + English in the same video), pick the dominant language and expect some errors on the switched segments.
Getting subtitles (.srt / .vtt)
The #1 downstream use of video transcription is subtitle files. If you’re headed to a video editor (Premiere, Final Cut, DaVinci Resolve, CapCut) or a captioning workflow (YouTube, Vimeo, LinkedIn), you need .srt or .vtt.
Cloud services generate these on export — check the export dropdown. YouTube’s auto-captions can be downloaded as .sbv (Studio) or converted from the transcript panel via third-party tools.
Subtitle files have timing rules (max characters per line, max lines on screen, minimum display time) that a raw transcript doesn’t. Our SRT generator guide walks through the format spec, common validator errors, and how to hand a video editor a caption file that actually loads cleanly.
Common gotchas
- Uploading 4K video wastes bandwidth and time. Extract audio first (Path 3). The transcript is identical; the upload is 10–20× faster.
- Language auto-detection fails on very short clips. Anything under 30 seconds is unreliable — set the language explicitly.
- Music-heavy videos produce garbage.This is a known model failure mode, not a vendor bug. Models routinely transcribe song lyrics as speech and vice versa. If your video has a music track under the voice, expect 45–70% accuracy at best.
- YouTube auto-caption ≠ YouTube manual caption.If a creator uploaded their own caption file, that’s the human version and will be near-perfect. If not, the auto-caption is what YouTube generated — quality varies.
- Word-level timestamps require JSON export. TXT export is plain text with no timing. SRT/VTT export has caption-level timing (each block spans several seconds). Only the JSON export has per-word timestamps.
- iPhone screen recordings sometimes have no audio. iOS screen recording defaults to system-audio-off. Toggle it on in the Control Center long-press before recording. A silent MP4 will transcribe as an empty file.
- Zoom local recordings (not cloud) don’t auto-transcribe.Only cloud-recorded Zoom sessions get the built-in transcript. For local recordings, you’re back to Path 1 or Path 3.
- “Free forever” tools that rate-limit.Several SERP results advertise “free video transcription” and cap at 10–15 minutes per upload. Read the fine print before starting a long transcription.
How this page was verified
Related guides
- MP4 to TextFormat-specific companion — MP4 codec details, container quirks, and when to strip the video stream.
- SRT GeneratorThe subtitle output format. Timing rules, .srt vs .vtt, and how to hand a video editor a caption file that actually validates.
- How Accurate Is WhisperWER benchmarks by audio condition and language for the model behind most cloud video transcription — including the failure modes.
- How to Transcribe Audio (pillar)The broader pillar — every path across sources and formats and how to pick.