Local Speech Recognition
Local Speech Recognition
Why Build This
I wanted voice-to-text without:
- Cloud API dependencies
- Monthly subscription fees
- Privacy concerns about audio being sent elsewhere
- Internet connectivity requirements
The goal: completely offline, always available, auto-paste anywhere.
The Solution
OpenAI’s Whisper model running locally via faster-whisper (optimized for CPU inference).
Implementation
Push-to-talk system in Python:
- Hold Alt key → recording starts
- Speak → audio captured locally
- Release Alt → transcription runs, auto-pastes via clipboard
Works across all Windows applications. No configuration needed.
Model Choice
Using Whisper small (484MB):
- Good accuracy for general use
- Fast transcription (~2-3 seconds for 10-second audio)
- Runs comfortably on CPU (no GPU required)
Larger models available (medium, large-v3) if accuracy matters more than speed.
Technical Details
Components:
faster-whisper- Optimized Whisper inference (CTranslate2-based)sounddevice- Audio capturekeyboard- Hotkey detection (Alt key)pyperclip- Clipboard integration for auto-paste
Windows encoding fix:
- Added UTF-8 wrapper for console output (Windows console uses cp1252 by default, can’t handle emojis in output)
Desktop integration:
- Created
.lnkshortcut for taskbar pinning - Runs in background, minimal resource usage when idle
What I Learned
Local AI is practical: No need to reach for cloud APIs for everything. Whisper models are small enough to run locally, accurate enough for daily use.
Python ecosystem is mature: Finding the right libraries (faster-whisper vs openai-whisper) makes a huge difference in performance.
Windows quirks: Console encoding, hotkey detection, clipboard access—each has its own edge cases. Testing reveals them quickly.
Current Status
Fully functional. Fixed UTF-8 encoding bug. Ready for daily use.
Possible enhancements:
- Voice activity detection (auto-start recording on speech)
- Integration with LM Studio (voice → LLM → voice response)
- Custom wake word detection
But the core use case works: speak, get text, move on.
Technologies: Python, Whisper (OpenAI), faster-whisper, sounddevice, keyboard Model: Whisper small (484MB) Status: Operational Privacy: 100% local, zero cloud calls