How to Convert Audio to Text on Mac: 5 Methods Compared
How to convert audio to text on Mac: macOS Sequoia, Whisper CLI, MacWhisper, and Aiko compared. Covers file transcription and real-time dictation options.
There are two different things people mean when they search "convert audio to text on Mac." The first: you have an existing audio file — a voice memo, a recorded interview, a Zoom export, a podcast episode — and you want a text transcript. The second: you want to speak and have text appear live, in whatever app you're working in.
The tools for each are different. This post covers both, starting with file transcription — the more common case — and finishing with real-time dictation.
For file transcription, there are five practical options on Mac in 2026. Three are free.
Five ways to convert audio to text on Mac#
Here's the short version before the detail:
| Method | Cost | Setup | Privacy | Best for |
|---|---|---|---|---|
| macOS Sequoia (Voice Memos / Notes) | Free | None | On-device | Quick memos, English, Sequoia only |
| Whisper CLI | Free | 15 min (terminal) | On-device | All formats, all languages, batch |
| MacWhisper | Free / $69 Pro | 5 min | On-device | GUI, drag-and-drop, everyday use |
| Aiko | Free | 2 min | On-device | Occasional use, max accuracy free |
| Cloud tools (Otter, Whisper API) | Free tier / per-min | None | Cloud | Quick one-off, no installation |
Method 1: macOS Sequoia built-in transcription (free)#
Best for: Transcribing recordings you already have in Voice Memos or Notes, on macOS 15+
macOS Sequoia added on-device audio transcription to two built-in apps.
Voice Memos transcribes any recording automatically when you finish. Open the recording, click the three-dot menu, and select "View Transcription." The text appears alongside the waveform. You can search all transcriptions, copy text, and share the transcript as a file.
Notes gained the ability to record audio directly in a note and generate a transcript in real time. After recording, a searchable text block appears below the audio. With Apple Intelligence enabled, Notes can also summarize the recording.
Both process audio using Apple's Neural Engine — nothing leaves your Mac. No internet connection, no data sent to Apple's servers.
Importing existing files: Voice Memos accepts M4A file imports. For MP3, WAV, or other formats, convert to M4A first with FFmpeg (ffmpeg -i audio.mp3 audio.m4a) and then import into Voice Memos via File → Import. Notes requires recording live and doesn't support arbitrary file imports.
Language support: English only, in qualifying countries (US, UK, Australia, Canada, Ireland, New Zealand). Not available in other languages.
Accuracy: Noticeably lower than Whisper-based apps. Adequate for searching personal notes; not reliable enough for polished transcripts or verbatim records of technical content.
This is the right option if you're on macOS Sequoia, your audio is in English, and your requirements are modest. Zero setup — it's already installed.
Method 2: Whisper CLI (free, most flexible)#
Best for: Developers and terminal users who want full control — any format, any language, batch processing
OpenAI released Whisper as an open-source model in September 2022 under the MIT license. The Python package runs on any Mac and accepts audio files in any format FFmpeg supports: MP3, WAV, M4A, MP4, FLAC, OGG, OPUS, and more.
Installation#
pip install openai-whisper
brew install ffmpeg
Model files download on first use. Turbo (1.5 GB) and Large V3 (3 GB) are the most useful for transcription work.
Basic transcription#
whisper audio.mp3 --model turbo
This writes a .txt file — and optionally .srt, .vtt, .tsv, .json — in the same directory as the input file. The Turbo model (Whisper Large V3 Turbo) is the best default: close to Large V3 accuracy, roughly 2x faster on Apple Silicon, and half the RAM.
Model comparison#
| Model | Parameters | Approx. RAM | Speed on Apple Silicon |
|---|---|---|---|
| tiny | 39M | ~1 GB | Very fast |
| base | 74M | ~1 GB | Fast |
| small | 244M | ~2 GB | Fast |
| turbo | 809M | ~1.5–2 GB | Moderate |
| large-v3 | 1.55B | ~3.1 GB | Slower |
For standard speech — clear voice recording, quiet environment, English or European languages — --model turbo is the right choice. Upgrade to large-v3 for difficult audio: heavy accents, background noise, or specialized vocabulary. See the Whisper Large V3 vs V3 Turbo comparison for the full benchmark breakdown.
Specifying language#
whisper audio.mp3 --model turbo --language French
Whisper supports 99 languages. Specifying the language explicitly (rather than relying on auto-detection) improves accuracy slightly on non-English audio.
Batch transcription#
for f in *.mp3; do whisper "$f" --model turbo --output_dir ./transcripts; done
This loops through every MP3 in the current directory and writes transcripts to a transcripts/ folder. Useful for processing a folder of meeting recordings or interview audio files.
whisper.cpp as an alternative#
If you prefer not to use Python, whisper.cpp is a C++ port of Whisper that installs via Homebrew:
brew install whisper-cpp
whisper.cpp runs faster than the Python package on Apple Silicon (it's more tightly optimized for Metal) and doesn't require a Python environment. The trade-off: the command syntax is different and output format options are slightly more limited. For most users, the Python package is simpler to get started with.
Pros of the CLI: Free, any audio format, any language, full output format control (TXT, SRT, VTT, JSON), batch processing, no GUI overhead.
Cons: Terminal familiarity required. First run downloads model files (1.5 GB for Turbo). Setup takes 15–20 minutes for someone new to pip or Homebrew.
Method 3: MacWhisper (free tier, drag-and-drop GUI)#
Best for: Regular file transcription without touching the terminal
MacWhisper is a native macOS app built on Whisper (and Parakeet) with a GUI. Drag in a file, select a model, click Transcribe.
Free tier includes Whisper Tiny, Base, and Small models. These handle clear speech well — good enough for voice memos, meeting notes, and podcast episodes recorded with a decent microphone. No time limit, no file count cap, no account required.
Pro tier ($69 one-time, as of early 2026) unlocks Large V2, Large V3, and Large V3 Turbo — the models you'd want for difficult audio, strong accents, or technical vocabulary. Pro also adds batch transcription (queue multiple files), speaker grouping (identify who spoke when), and system audio recording.
Supported formats: MP3, WAV, M4A, MP4, MOV, OGG, OPUS — anything macOS can read.
Export formats: TXT, CSV, PDF, SRT. The SRT export is particularly useful if you're adding captions to a video: transcribe the audio, export as SRT, load into your video editor.
Privacy: All processing is on-device. No audio leaves your Mac. MacWhisper has no network activity during transcription — you can verify this with Little Snitch or a network monitor.
Language support: 100+ languages via Whisper.
The workflow is straightforward: drag a file onto the MacWhisper window, choose a model from the dropdown, and get a transcript in under a minute for most files on an M-series Mac. For a 30-minute recording using Large V3 Turbo, expect roughly 30–60 seconds of processing on an M2.
For audio file transcription, MacWhisper is the most fully-featured native app on Mac. The free tier is genuinely usable — you're not walled off after a few files. The $69 Pro upgrade makes sense if you transcribe regularly and need the larger, more accurate models or batch processing.
AI Transcription That Stays on Your Mac
Run Whisper and Parakeet locally with a native Mac app. No Python setup, no command line.
Method 4: Aiko (free, App Store)#
Best for: One-off transcription with maximum accuracy, no setup beyond the App Store
Aiko is a free app from Sindre Sorhus, available on the Mac App Store. Unlike MacWhisper's free tier, which uses Whisper's smaller models, Aiko uses Whisper Large V3 — the largest, most accurate Whisper model — at no charge.
Drag in an audio or video file, click Transcribe. No account, no subscription, no command line.
Model: Whisper Large V3 on Mac. Whisper Large V3 achieves approximately 2.7% word error rate on the LibriSpeech test-clean benchmark — the same accuracy you'd get from MacWhisper Pro or the CLI with --model large-v3. You get Large V3 accuracy without paying.
RAM: Whisper Large V3 needs roughly 3.1 GB of unified memory. Aiko's documentation recommends 16 GB of RAM. On 8 GB Macs, model loading is slow and memory pressure is high during transcription — you'll see other apps slowing down.
Supported formats: Any audio or video format macOS supports: M4A, WAV, MP3, MP4, MOV.
Language support: 100+ languages.
Export: Copy transcript to clipboard or save as text.
Privacy: On-device only. The transcription runs on your Mac; nothing leaves.
Limitations: No model selection (you get Large V3 and only Large V3). No batch processing. No SRT export. No speaker grouping. The app does one thing.
Aiko is the right choice for: free access to Large V3 accuracy, occasional use without a Pro subscription, and simple interface without configuration options. If you transcribe files frequently or need batch processing, MacWhisper Pro is more practical.
Method 5: Cloud tools#
Best for: One-off transcription with no installation, or when local hardware is limited
If you don't need privacy and want to transcribe quickly without installing anything:
Otter.ai offers 600 free minutes per month. Upload an audio file or record live. The free tier handles most casual use; paid plans add custom vocabulary and longer recordings. Otter processes audio on their servers — your audio leaves your Mac.
OpenAI Whisper API accepts file uploads and returns a transcript. If you have an OpenAI account, you can use the Playground or the API. The same Whisper model runs on OpenAI's infrastructure. Priced per second of audio processed; cost is low for occasional use.
AssemblyAI is an API-first service with high accuracy, speaker diarization, and 99+ language support. Designed for developers building transcription into applications. Not a point-and-click tool, but worth knowing if you're building something.
All cloud tools send your audio to external servers. The calculus is simple: if the content is confidential — medical notes, legal discussions, financial calls, anything you wouldn't post publicly — use a local option. If you're transcribing something you don't mind existing on someone else's server, cloud tools remove all installation friction and work on any Mac regardless of age or RAM.
One cloud advantage that matters: services using OpenAI's GPT-4o-transcribe model have better accuracy on difficult audio — heavy accents, background noise, specialized terminology — than Whisper Large V3. For a podcast recorded in a noisy room or an interview with a thick accent, the accuracy gap can be meaningful. See AI transcription: local vs cloud for the benchmark comparison.
Real-time dictation: converting live speech to text#
File transcription and real-time dictation solve different problems. Real-time dictation captures live speech and types directly into whatever Mac app is in focus — email, Slack, a code editor, a browser text field, anything.
macOS built-in dictation (System Settings → Keyboard → Dictation) is free and requires no additional app. On Apple Silicon Macs running macOS Sequoia, dictation processes on-device using Apple's Neural Engine. On older hardware or without Apple Intelligence enabled, audio may be sent to Apple's servers. Accuracy on technical vocabulary is limited; custom vocabulary isn't supported.
For real-time dictation that's entirely local and handles technical content reliably, dedicated apps are better:
Hearsy uses NVIDIA Parakeet TDT for English dictation. Parakeet TDT 0.6B v2 achieves 1.69% word error rate on LibriSpeech clean benchmarks (NVIDIA, 2025) — better than Whisper Large V3's approximately 2.7% WER for standard English. More importantly, Parakeet is a streaming transducer model, meaning text appears while you're still speaking rather than after you pause. Latency is under 50ms on Apple Silicon. Whisper mode is available for 99-language support. All processing is local; nothing leaves your Mac. One-time purchase.
SuperWhisper uses Whisper and Parakeet for real-time dictation, with AI post-processing options. It also supports file transcription — importing an audio or video file and getting a transcript — which makes it the only app in this list that handles both use cases. Subscription pricing.
The key architectural difference between file apps and real-time apps: file apps run the model once on a completed audio file, processing all the context in a single pass. Real-time apps must generate text while audio is still coming in. Parakeet's streaming transducer architecture is designed for this; Whisper processes audio in fixed 30-second chunks and emits text at the end of each chunk. For dictating a sentence, Parakeet feels instantaneous; Whisper feels like a brief delay at the end of each phrase.
If you're dictating into text fields — writing emails, taking notes, drafting documents — the latency difference matters more than the benchmark difference. This is why Hearsy defaults to Parakeet for English rather than using Whisper for everything.
Which audio-to-text method to use#
On macOS Sequoia, English, no special requirements: Voice Memos or Notes. Already installed, on-device, free. Accurate enough for personal notes and memos.
Best free option for file transcription with maximum accuracy: Aiko. Free from the App Store, uses Whisper Large V3. The trade-off is speed and memory pressure on 8 GB Macs.
Regular file transcription with a proper GUI: MacWhisper. The free tier handles most files. The $69 Pro upgrade adds Large V3 Turbo, batch transcription, and speaker grouping. The most complete dedicated transcription app for Mac.
Terminal user who wants full control: Whisper CLI (pip install openai-whisper). Free, any format, 99 languages, all output formats. --model turbo is the right default; upgrade to large-v3 for difficult audio.
Privacy-critical content: Any local option. File transcription: MacWhisper, Aiko, or Whisper CLI. Real-time dictation: Hearsy or built-in macOS dictation on Apple Silicon. None of these send audio to external servers.
Quick one-off on any machine, no installation: Otter.ai free tier or OpenAI's Whisper Playground. Fast, no setup, but audio leaves your device.
Real-time voice typing in any Mac app: macOS built-in dictation (free, less accurate), or Hearsy (Parakeet engine, under 50ms, more accurate on standard English). SuperWhisper handles both real-time dictation and file transcription if you want both in one app.
Frequently asked questions#
What is the best free way to convert audio to text on Mac?#
MacWhisper's free tier uses Whisper Tiny, Base, and Small models for file transcription with no time limits — no account or internet connection required. On macOS Sequoia and later, Voice Memos and Notes transcribe recordings on-device for free. Aiko (App Store, free) uses Whisper Large V3, the most accurate free option, but requires 16 GB RAM for comfortable performance.
Can macOS convert audio files to text natively?#
On macOS Sequoia (15) and later, yes. Voice Memos transcribes recordings automatically and accepts M4A file imports. Notes records and transcribes audio directly in a note. Both use Apple's Neural Engine for on-device processing. These features work for English in qualifying countries: US, UK, Australia, Canada, Ireland, and New Zealand. Other languages are not supported.
How do I convert an MP3 to text on Mac?#
MacWhisper (drag-and-drop GUI, free tier), Aiko (free App Store app), or Whisper CLI (whisper audio.mp3 --model turbo). All three run locally on Apple Silicon without uploading audio to any server. For the CLI, install with pip install openai-whisper and brew install ffmpeg. MacWhisper and Aiko handle MP3 natively; no conversion needed.
Does Whisper run locally on Mac?#
Yes. Both the Python Whisper package (pip install openai-whisper) and whisper.cpp run entirely on-device. No audio is transmitted to any server during transcription. On Apple Silicon, Whisper Large V3 Turbo processes audio at roughly 60x real-time speed — a 10-minute file takes about 10 seconds to transcribe on an M2 MacBook Pro.
What is the difference between audio file transcription and real-time dictation?#
File transcription converts an existing audio recording into text. You provide a file (MP3, WAV, M4A, MP4); you get a text transcript. Real-time dictation captures live speech and types it directly into any app in focus — email, Slack, a browser text field. MacWhisper and Aiko are designed for file transcription. Hearsy and SuperWhisper are built for real-time dictation; Hearsy's Parakeet engine streams text with under 50ms latency on Apple Silicon.
For a comparison of Mac apps that use Whisper for real-time dictation, see best dictation software for Mac. For how local Whisper accuracy compares to cloud services, see AI transcription: local vs cloud. For a full guide to OpenAI's Whisper model — architectures, benchmarks, and how to run it — see the Whisper transcription guide. For setting up Whisper without the terminal, see how to run Whisper locally on Mac.
Ready to Try Voice Dictation?
Hearsy is free to download. No signup, no credit card. Just install and start dictating.
Download Hearsy for MacmacOS 14+ · Apple Silicon · Free tier available
Related Articles
What Is OpenAI Whisper? A Complete Guide to Local AI Transcription
14 min read
Automatic Speech Recognition: How ASR Works in 2026
15 min read
Best Whisper Apps for Mac in 2026: 7 Apps Compared
17 min read
Whisper Large V3 vs V3 Turbo: Speed, Accuracy, Memory
10 min read
AI Dictation: How AI Makes Voice Typing Actually Usable
9 min read