Whisper AI: Local Speech-to-Text with Python — No API Key Needed (2026)

OpenAI released Whisper as open-source software, which means you can run it entirely on your own machine. No API key, no per-minute billing, no audio sent to any server. This guide covers everything from a first transcription to batch processing, subtitle export, GPU acceleration, and wrapping Whisper in a Flask API.

What is Whisper?

OpenAI Whisper is a free, open-source speech recognition model that runs entirely on your machine. No API key, no cost, no data sent to the cloud.

Key features:

Transcribes 99 languages
Word-level timestamps
Automatic language detection
Works on CPU (slow) or GPU (fast)

Whisper was trained on 680,000 hours of multilingual audio from the internet, which gives it strong robustness to accents, background noise, and technical vocabulary — qualities that rule-based systems and earlier neural models struggle with.

Installation

Whisper requires ffmpeg for audio decoding, which must be installed separately from the Python package.

# Whisper requires ffmpeg
sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg      # macOS

pip install openai-whisper

For GPU acceleration, install faster-whisper instead:

pip install faster-whisper

faster-whisper is a reimplementation of Whisper using CTranslate2, a fast inference engine. It is 4x faster than the original and uses less VRAM.

Transcribe Your First File

import whisper

model = whisper.load_model("base")  # tiny, base, small, medium, large
result = model.transcribe("audio.mp3")
print(result["text"])

load_model downloads the model weights on first run (stored in ~/.cache/whisper) and loads them into memory. The returned result dict contains text (the full transcript), segments (timed chunks), and language (detected language code).

Model Sizes

Choosing a model is a trade-off between speed, accuracy, and memory usage. Start with base for quick experiments and move up when you need higher accuracy.

Model	Size	Speed	Accuracy	VRAM
tiny	39M	Fastest	Basic	1GB
base	74M	Fast	Good	1GB
small	244M	Medium	Better	2GB
medium	769M	Slow	Great	5GB
large-v3	1.5GB	Slowest	Best	10GB

On CPU, base transcribes roughly 5–10x real-time (a 10-minute file takes 1–2 minutes). On a modern NVIDIA GPU, large-v3 transcribes near real-time.

Word-Level Timestamps

Segment-level timestamps tell you when each sentence starts and ends. Word-level timestamps go further and mark each individual word.

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
    for word in segment.get("words", []):
        print(f"  Word: '{word['word']}' at {word['start']:.2f}s")

Word timestamps are useful for karaoke-style subtitle generation, forced alignment, and finding exact moments in long recordings.

Export to SRT Subtitle Format

SRT is the most widely supported subtitle format. Each entry has an index, a time range, and the text. This function converts a Whisper result directly to SRT:

def format_timestamp(seconds: float) -> str:
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{ms:03d}"

def to_srt(result) -> str:
    srt = []
    for i, segment in enumerate(result["segments"], 1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        srt.append(f"{i}\n{start} --> {end}\n{text}\n")
    return "\n".join(srt)

model = whisper.load_model("small")
result = model.transcribe("video.mp4")

with open("subtitles.srt", "w") as f:
    f.write(to_srt(result))

The resulting .srt file can be loaded into VLC, imported into YouTube Studio, or used with ffmpeg to burn subtitles into a video.

Batch Process a Folder

When you have many files to transcribe, iterate over them and write each transcript to a .txt file alongside the original:

import whisper
from pathlib import Path

model = whisper.load_model("base")
audio_dir = Path("./recordings")

for audio_file in audio_dir.glob("*.mp3"):
    print(f"Processing {audio_file.name}...")
    result = model.transcribe(str(audio_file))

    output = audio_dir / f"{audio_file.stem}.txt"
    output.write_text(result["text"])
    print(f"  Saved to {output}")

To extend this for multiple formats, replace "*.mp3" with a loop over extensions:

extensions = ["*.mp3", "*.wav", "*.m4a", "*.ogg", "*.flac"]
audio_files = [f for ext in extensions for f in audio_dir.glob(ext)]

Loading the model once before the loop is important — each load_model call reads weights from disk and allocates GPU memory.

ffmpeg Pipeline: Video to Text

Whisper accepts audio files directly, so you can point it at an .mp4 and it will extract the audio internally. However, pre-extracting the audio with ffmpeg gives you more control over sample rate and channel count, which can improve accuracy.

# Extract audio from video first
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

The flags -ar 16000 -ac 1 convert the audio to 16kHz mono, the format Whisper expects internally — skipping this conversion reduces processing time slightly.

In Python, you can extract audio and transcribe in one function:

import subprocess
import whisper
import tempfile
import os

def transcribe_video(video_path: str) -> str:
    model = whisper.load_model("base")

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp_path = tmp.name

    try:
        subprocess.run([
            "ffmpeg", "-i", video_path,
            "-vn", "-acodec", "pcm_s16le",
            "-ar", "16000", "-ac", "1",
            tmp_path, "-y"
        ], check=True, capture_output=True)

        result = model.transcribe(tmp_path)
        return result["text"]
    finally:
        os.unlink(tmp_path)

text = transcribe_video("lecture.mp4")
print(text)

The finally block ensures the temporary WAV file is deleted even if transcription raises an exception.

faster-whisper: GPU-Accelerated Transcription

faster-whisper uses the same model weights as the original but runs inference through CTranslate2, which applies quantization and other optimizations at runtime.

from faster_whisper import WhisperModel

# Use GPU if available
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# CPU: model = WhisperModel("base", device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.0%})")

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

faster-whisper is 4x faster than original Whisper and uses less VRAM. On a GPU, large-v3 with float16 is the recommended configuration — it matches the full model accuracy at roughly half the memory cost of float32.

Note that model.transcribe returns a generator. The segments are produced lazily as inference runs, so you can start processing output before the full file is done.

Language Detection

Whisper detects the language automatically, but you can also query it explicitly without running a full transcription. This is useful when you want to route audio to different processing pipelines based on language.

import whisper

model = whisper.load_model("base")
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

_, probs = model.detect_language(mel)
detected = max(probs, key=probs.get)
print(f"Detected language: {detected}")

detect_language only processes the first 30 seconds of audio, which is usually enough for reliable detection. The probs dict maps language codes (e.g. "en", "fr", "zh") to probability scores.

Translate to English

Whisper can transcribe and translate in a single pass. Set task="translate" and it outputs an English translation regardless of the source language:

result = model.transcribe("spanish_audio.mp3", task="translate")
print(result["text"])  # English translation

Translation quality is generally good for European languages and strong for languages well-represented in Whisper's training data. For high-stakes translation, review the output.

Run as a Flask API

Wrapping Whisper in a Flask endpoint lets any application on your network submit audio files for transcription over HTTP.

from flask import Flask, request, jsonify
import whisper
import tempfile
import os

app = Flask(__name__)
model = whisper.load_model("base")

@app.route("/transcribe", methods=["POST"])
def transcribe():
    if "audio" not in request.files:
        return jsonify({"error": "No audio file"}), 400

    audio_file = request.files["audio"]
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
        audio_file.save(tmp.name)
        result = model.transcribe(tmp.name)
        os.unlink(tmp.name)

    return jsonify({
        "text": result["text"],
        "language": result["language"]
    })

if __name__ == "__main__":
    app.run(port=5000)

Send a file with curl:

curl -X POST http://localhost:5000/transcribe \
  -F "[email protected]"

The model is loaded once at startup so each request reuses it. For production, run this with gunicorn and a single worker (Whisper is not thread-safe with shared model state across multiple workers — use one worker per GPU or use a task queue like Celery).