Whisper vs Parakeet: Speed, Accuracy, and Language Support
Parakeet TDT 0.6B v2: 1.69% WER, 3,000x real-time speed. Whisper Large V3: 99 languages, 2.7% WER. Full benchmark comparison with practical guidance.
NVIDIA Parakeet TDT 0.6B v2 achieves 1.69% word error rate on LibriSpeech and processes audio at over 3,000 times real-time speed. OpenAI Whisper Large V3 achieves 2.7% WER on the same benchmark and covers 99 languages. These are the two speech recognition engines built into Hearsy, and the choice between them affects latency, accuracy, and language coverage in concrete ways.
This post covers the architecture, benchmark numbers, language support, and memory requirements for each model — and when to use which.
What each model is#
OpenAI Whisper is an encoder-decoder transformer released in September 2022. It was trained on 680,000 hours of multilingual audio and released under the MIT license. The model family runs from Tiny (39M parameters) to Large V3 (1.55 billion parameters). Whisper processes audio in 30-second windows, which creates the processing pause between when you stop speaking and when text appears.
NVIDIA Parakeet TDT uses a different architecture: FastConformer XL encoder with a Token-and-Duration Transducer (TDT) decoder. The TDT decoder predicts both the token identity and its duration simultaneously, which enables streaming output — text appears while you are still speaking rather than after a complete audio chunk is processed. Parakeet TDT 0.6B v2 has 600 million parameters and was released in May 2025 under the Apache 2.0 license.
The architectural difference is significant for real-time dictation. Whisper was designed to transcribe finished audio. Parakeet was designed for streaming — it outputs text incrementally during continuous speech.
Accuracy benchmarks#
| Benchmark | Parakeet TDT 0.6B v2 | Whisper Large V3 |
|---|---|---|
| LibriSpeech test-clean (WER) | 1.69% | 2.7% |
| LibriSpeech test-other (WER) | 3.19% | 5.2% |
| HF Open ASR Leaderboard avg WER | 6.05% | 7.4% |
| Parameters | 600M | 1.55B |
| License | Apache 2.0 | MIT |
Sources: NVIDIA technical blog (May 2025); OpenAI Whisper paper (2022); Hugging Face Open ASR Leaderboard.
On English-language benchmarks, Parakeet TDT 0.6B v2 outperforms Whisper Large V3 with less than half the parameters. The LibriSpeech test-clean set measures accuracy on clear studio-quality speech — the conditions under which most dictation happens. Both models score well below the human error threshold on this benchmark.
The test-other set measures harder conditions: spontaneous speech, accented speakers, background noise. Parakeet's 3.19% versus Whisper's 5.2% holds the accuracy advantage under these conditions, though the absolute gap narrows.
Where Whisper has the edge: any language outside Parakeet's supported list. Parakeet v2 is English-only. Parakeet v3 (released August 2025) adds 24 European languages. Whisper Large V3 covers 99 languages including Japanese, Mandarin, Arabic, and Hindi. If your dictation isn't primarily in a European language, Whisper is the only option.
Speed comparison#
| Metric | Parakeet TDT 0.6B v2 | Whisper Large V3 Turbo | Whisper Large V3 |
|---|---|---|---|
| Real-time factor | ~3,380x | ~216x | Significantly slower |
| 1-hour audio (server hardware) | ~1 second | ~17 seconds | ~100 seconds |
| Streaming output during speech | Yes | No | No |
| Dictation latency in Hearsy | ~0.2 seconds | — | 1–2 seconds |
Parakeet transcribes an hour of audio in roughly one second on benchmark hardware (NVIDIA, May 2025). On Apple Silicon Macs, this translates to response latency around 0.2 seconds in Hearsy — the gap between when you stop speaking and when text appears in your document.
Whisper Large V3 processes audio in 30-second chunks using a 32-layer decoder. OpenAI's Whisper Large V3 Turbo reduced that decoder to 4 layers, achieving around 216x real-time speed — roughly 17 seconds per hour of audio. Standard Whisper Large V3 is slower than Turbo. In Hearsy's dictation context, Whisper takes 1–2 seconds to respond after each dictation segment.
For batch transcription of audio files, the speed difference is relevant but not critical — both produce results in seconds or minutes, not hours. For real-time typing by voice, the difference between 0.2 seconds and 1–2 seconds is noticeable in daily use.
Continue reading
AI Transcription That Stays on Your Mac
Run Whisper and Parakeet locally with a native Mac app. No Python setup, no command line.
Language support#
| Model | Languages supported |
|---|---|
| Whisper Large V3 | 99 languages |
| Parakeet TDT 0.6B v3 | 25 European languages |
| Parakeet TDT 0.6B v2 | English only |
Parakeet v3, released in August 2025, covers Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, and Ukrainian.
Whisper was trained on 680,000 hours of audio across 99 languages. Accuracy is highest for English and high-resource European languages; lower for languages with limited training data. But Whisper at least attempts transcription in all 99 — Parakeet does not.
If you dictate in Japanese, Arabic, Hindi, Korean, or any language outside Parakeet's 25, Whisper is your only option regardless of speed preference.
How they process audio#
The core architectural difference explains why these models behave so differently in practice.
Whisper uses an encoder-decoder design with cross-attention. The encoder converts the full audio chunk into a representation; the decoder generates text tokens by attending to that representation. This requires the complete audio chunk before generating any output — you cannot see partial results mid-sentence.
Parakeet TDT uses a FastConformer encoder that processes audio frames incrementally. The TDT decoder predicts token-duration pairs at each frame rather than waiting for the full audio chunk. This enables the model to emit tokens mid-utterance, which is why Hearsy's Parakeet mode feels more like a human transcriptionist following along in real time.
Neither design is objectively superior. Encoder-decoder models with full attention often handle ambiguous audio better because the decoder can look at more context before committing to a transcription. Streaming transducers sacrifice some of that global context for lower latency. For standard dictation conditions (clear speech, quiet room), the practical accuracy difference on benchmarks favors Parakeet. Under difficult conditions, Whisper's broader context can help.
Memory requirements on Apple Silicon#
| Model | RAM footprint | Runs on 8GB Mac |
|---|---|---|
| Parakeet TDT 0.6B v2 | ~1.2 GB | Yes |
| Whisper Base | ~0.5 GB | Yes |
| Whisper Small | ~0.8 GB | Yes |
| Whisper Medium | ~1.5 GB | Yes |
| Whisper Large V3 | ~3.1 GB | Yes, limited headroom |
Parakeet TDT 0.6B v2 fits in about 1.2 GB of unified memory, which leaves plenty of headroom on 8 GB MacBook Air and MacBook Pro models. Whisper Large V3 needs roughly 3.1 GB — workable on 8 GB machines, but you will notice reduced headroom when running other memory-intensive apps alongside Hearsy.
Hearsy loads the model once at startup and keeps it resident in memory. Switching engines in Settings requires a model reload — it happens once and takes a few seconds, then subsequent dictations start immediately.
Which to use#
Choose Parakeet if:
- You dictate primarily in English or a European language
- Low latency matters — 0.2s response beats 1–2s in daily use
- You're on an 8 GB Mac and want better accuracy with lighter RAM usage
- You process mostly standard speech in a quiet environment
Choose Whisper if:
- You need a language not in Parakeet's 25 — Japanese, Mandarin, Arabic, Hindi, and 74 others
- You frequently dictate medical, legal, or highly technical terminology where Whisper's broader multilingual training corpus may handle domain-specific terms better
- You're transcribing audio files rather than real-time dictation, where 1–2 seconds of delay doesn't affect your workflow
- You're processing audio with heavy accents or background noise, where Whisper's larger context window can help
Hearsy ships with Parakeet as the default because it's faster and more accurate for the most common scenario: English real-time dictation. You can switch engines in Settings → Speech Engine at any time, and switch back without any data loss.
Summary#
Parakeet TDT 0.6B v2 outperforms Whisper Large V3 on accuracy (1.69% vs 2.7% WER on LibriSpeech), uses less than half the parameters (600M vs 1.55B), and processes audio at over 3,000 times real-time speed. The trade-off is language coverage: 25 European languages versus Whisper's 99.
Whisper Large V3 remains the right choice whenever you need a language outside Parakeet's list, or when you're transcribing audio where Whisper's broader multilingual training helps — non-European languages, heavy accents, or highly specialized vocabulary.
For most Mac users dictating in English or European languages, Parakeet is the better engine. For multilingual work or specialized transcription, Whisper is the right choice.
For how both engines compare to cloud transcription services like GPT-4o-transcribe, see AI transcription: local vs cloud. For an overview of the best dictation apps that run these models locally, see the best dictation software for Mac guide.
Frequently asked questions#
What is the difference between Whisper and Parakeet?#
OpenAI Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual audio across 99 languages. NVIDIA Parakeet TDT uses a FastConformer encoder with a streaming transducer decoder that outputs tokens mid-speech rather than waiting for a complete audio chunk. Practical differences: Parakeet is more accurate on English (1.69% vs 2.7% WER on LibriSpeech), roughly 6x faster for batch processing, and designed for real-time streaming. Whisper covers 99 languages; Parakeet v3 covers 25 European languages.
Is Parakeet more accurate than Whisper?#
On English benchmarks, yes. Parakeet TDT 0.6B v2 achieves 1.69% word error rate on LibriSpeech test-clean; Whisper Large V3 achieves 2.7% WER on the same benchmark. Parakeet also scores 6.05% average WER on the Hugging Face Open ASR Leaderboard versus 7.4% for Whisper Large V3 (NVIDIA, May 2025). For languages outside Parakeet's 25 supported European languages, Whisper is the only viable option.
How fast is Parakeet compared to Whisper?#
Parakeet TDT 0.6B v2 processes audio at roughly 3,380 times real-time speed — transcribing an hour of audio in about one second on benchmark hardware. Whisper Large V3 Turbo reaches about 216x real-time (around 17 seconds per hour). In Hearsy's real-time dictation context, this translates to approximately 0.2 seconds response latency for Parakeet versus 1–2 seconds for Whisper Large V3.
Does Parakeet require an NVIDIA GPU to run on Mac?#
No. NVIDIA Parakeet uses Core ML for inference on Apple Silicon. The "NVIDIA" in the name refers to the organization that developed the model, not a hardware requirement. Hearsy ships Parakeet through FluidAudio's optimized runtime for M1, M2, M3, and M4 Macs. No NVIDIA hardware needed.
Which engine does Hearsy use by default?#
Hearsy defaults to Parakeet TDT for its combination of speed and accuracy on English and European languages. Switch to Whisper Large V3 in Settings → Speech Engine for 99-language support. Both engines run entirely locally — no audio is sent to any server regardless of which engine you choose.
Ready to Try Voice Dictation?
Hearsy is free to download. No signup, no credit card. Just install and start dictating.
Download Hearsy for MacmacOS 14+ · Apple Silicon · Free tier available