Local Whisper is a small web app that turns audio files into text using OpenAI’s
Whisper
speech-recognition model. Everything runs locally in your
tab: your recording or file is not sent to our servers for transcription.
The app loads an English Whisper checkpoint (Tiny, Base, or Small) through
Transformers.js. ONNX weights
are fetched from
Hugging Face
the first time you pick a model, then stored in the browser’s cache. Inference runs in a background worker —
either on the CPU via
WebAssembly
or on the GPU via
WebGPU
when your browser supports it. Long files are handled in overlapping time windows so Whisper can stream partial
text while it works through the clip.
Does my audio get uploaded?
Transcription happens in your browser. Audio you select stays on your device for processing. Model files are
downloaded from Hugging Face’s CDN into your browser cache (like loading a heavy static site asset), not
uploaded as playable audio for cloud transcription.
Why is the first run slow?
The first time you use a given model size, the app downloads ONNX weights (hundreds of MB for larger
checkpoints). Later visits reuse the cached files, so startup is much quicker.
What does WebAssembly do here?
WebAssembly
(wasm) is portable bytecode that runs in a browser sandbox. The ONNX runtime uses it so inference can execute on the
CPU without a plug-in when you choose the WASM option.
What is WebGPU?
WebGPU
is a browser API that gives pages access to the GPU through a unified interface. Chromium-based browsers can
use it to accelerate ONNX inference on the GPU where supported.
WebGPU vs CPU — what should I choose?
Try GPU (WebGPU) in Chromium-based browsers if it’s stable on your
machine — it often reduces wall-clock time once the model is cached. Fall back to
CPU (WebAssembly) if WebGPU fails, isn’t supported, or uses too
much GPU memory elsewhere.
Which languages are supported?
This build uses English-tuned Whisper checkpoints (.en models).