Yes. MIT licensed open source. Groq and Gemini both have free tiers that are enough for daily use. Ollama is completely free with no limits. There is no pro plan, no premium tier, no credit card needed.

Can I use it without internet?

Yes. Install Ollama, download a Whisper model (small is 480MB), and Rota works 100% offline. No API keys, no accounts, no internet. Your voice data never leaves your machine.

Is Rota AI a free alternative to Wispr Flow?

Yes, Rota AI is a free, open source alternative to Wispr Flow. It offers similar features like AI-powered text cleanup, context-aware dictation, and works across multiple apps. Unlike Wispr Flow, Rota AI is completely free with no subscriptions, offers offline mode, has zero telemetry, and the full source code is available under MIT license.

Rota AI has zero telemetry. The desktop app never phones home. Your voice data goes only to the transcription service you choose (Groq, Gemini, or local Ollama). API keys are encrypted at rest using your OS keychain.

Does Rota AI work with VS Code?

Yes. Rota detects VS Code and preserves camelCase, snake_case, and code syntax. Your code comments come out clean without extra punctuation. It also works in terminals and IDEs.

What are the system requirements?

Windows 10/11, macOS 13+, or Linux (Ubuntu 20.04+, Fedora 36+, Arch). 4GB RAM minimum, 8GB recommended. For local GPU transcription: NVIDIA GPU with 4GB+ VRAM. CPU-only works on any modern quad-core.

13. How to Transcribe Audio Files with Whisper on Windows

TL;DR: Install faster-whisper with pip, run one command, and get a transcript from any audio file. Works on Windows, runs on CPU or GPU, outputs to text or SRT. Here is the full walkthrough.

So here is the thing. Sometimes you do not want real-time dictation. Sometimes you already have a recorded audio file and you just need the transcript. Maybe it is a meeting recording. Maybe it is a voice memo. Maybe it is a podcast.

That is what this is about. Taking an audio file you already have and turning it into text. Locally on Windows. No cloud. No subscriptions. No uploading your audio to anyone.

Why not just use a Cloud API?

Fair question. Services like OpenAI Whisper API exist. They work well. But:

You are uploading audio to someone else's server.
API costs money at scale.
You need internet for every file.
Some audio is sensitive and you just do not want it leaving your machine.

Local transcription fixes all of that. Once it is set up, you can transcribe as many files as you want. Free. Forever. Offline.

What is faster-whisper?

OpenAI released Whisper. It is a speech recognition model. It is open source. It is really good.

But the default Python package (openai-whisper) can be annoying to install on Windows. Dependencies, CUDA issues, all that.

faster-whisper is a reimplementation that is:

Easier to install
Faster at transcription (like 4x faster)
Uses less memory
Fully compatible with Whisper models

It is what I recommend for Windows users. It just works.

Installation

Open a terminal. PowerShell or CMD, either works.

First, make sure you have Python 3.8+:

python --version

If you do not have Python, grab it from python.org. The Windows installer is straightforward.

Install faster-whisper:

pip install faster-whisper

No CUDA toolkit needed for CPU. If you have an ndvidia GPU, you will want CUDA installed for GPU acceleration. I will cover that below.

Basic Transcription

Say you have a file called lecture.mp3. You want a transcript. Run:

whisper lecture.mp3 --model medium

The model downloads automatically the first time (about 1.5GB for medium). Then it transcribes. When it finishes, you get a .txt file, an .srt file, and a .json file next to your audio file.

The .txt is the plain transcript. The .srt is a subtitle file with timestamps. The .json has word-level timestamps and metadata.

Choosing a Model

faster-whisper supports all the standard Whisper models. Here is the breakdown:

Model	Size	RAM Needed	GPU VRAM	Speed	Accuracy
tiny	75MB	4GB	1GB	Very Fast	Low
base	140MB	4GB	1GB	Fast	Decent
small	250MB	6GB	2GB	Good	Okay
medium	1.5GB	8GB	5GB	Moderate	Good
large-v3	3GB	12GB	10GB	Slow	Best

I use medium on my machine (i5 12th gen, 16GB RAM, RTX 3050). It gives the best balance of speed and accuracy for English.

If you have a weaker machine, use small. It is honestly not that bad. Only use large if accuracy is critical and you have a good GPU.

Output Formats

faster-whisper supports multiple output formats. Use the --output_format flag:

whisper lecture.mp3 --model medium --output_format txt

Available formats:

txt - Plain text transcript. Just the words, no timestamps.
srt - Subtitle format with timestamps. Works with video players.
vtt - WebVTT subtitles. Same idea as SRT, slightly different format.
json - Full data including word-level timestamps, confidence scores, and segments.
tsv - Tab-separated. Good for loading into a spreadsheet.
all - Generates every format. This is the default.

For most people, txt is what you want. If you need timestamps, go with srt or vtt.

Real Numbers on My Machine

I tested this on my Dell G15. i5 12th gen, 16GB RAM, RTX 3050 4GB.

Audio file: 30 minute college lecture. MP3 format, 128kbps.

Model	CPU Time	GPU Time
small	8 min	1.5 min
medium	18 min	3 min
large	Not tested (4GB VRAM not enough)

GPU is with the --device cuda --compute_type int8 flags. CPU is default.

So on GPU, medium model transcribes a 30 minute file in about 3 minutes. That is pretty fast tbh.

GPU Setup (Optional but Recommended)

If you have an ndvidia GPU, using it makes transcription way faster. Here is what you need:

Install CUDA Toolkit 11.8 or 12.x from ndvidia's website.
Install faster-whisper with CUDA support:

pip install faster-whisper

The standard pip install already includes CUDA support if CUDA is detected. You do not need a separate package.

Run with GPU:

whisper lecture.mp3 --model medium --device cuda --compute_type int8

The int8 compute type uses less VRAM. On my 4GB RTX 3050, medium works with int8 but not with float16. If you have more VRAM, try float16 or float32 for slightly better accuracy.

If CUDA is not detected, faster-whisper falls back to CPU automatically. No error, just slower.

Batch Processing Meeting Recordings

Here is where this actually gets useful in practice. I had a folder with like 20 meeting recordings from a college project. All MP3 files. I needed transcripts of all of them.

Doing them one by one would be annoying. So I made a batch script.

On Windows, save this as batch_transcribe.bat:

@echo off
set MODEL=medium
set FORMAT=srt

for %%f in (*.mp3 *.wav *.m4a *.flac) do (
    echo Transcribing %%f...
    whisper "%%f" --model %MODEL% --output_format %FORMAT% --language en
    echo Done: %%f
)

echo All files processed.
pause

Drop that script in your folder with audio files. Double click it. It loops through every audio file and transcribes it. Just let it run.

For Linux or WSL users, here is the bash version:

#!/bin/bash
MODEL="medium"
FORMAT="srt"

for f in *.mp3 *.wav *.m4a *.flac; do
  [ -f "$f" ] || continue
  echo "Transcribing $f..."
  whisper "$f" --model "$MODEL" --output_format "$FORMAT" --language en
done

echo "All files processed."

I had about 15 files totaling roughly 8 hours of audio. Using medium model on GPU, the whole batch took about 45 minutes. Went to ndmade a sandwich, came back, all done. Fr, that is the dream workflow.

Specifying Language

Whisper auto-detects language. But if you know the language, specifying it speeds things up and improves accuracy:

whisper lecture.mp3 --model medium --language en

Supported languages include en, es, fr, de, hi, ta, zh, ja, ko, and a bunch more. If the language code does not work, try the full name like "English" or "Tamil."

My Actual Use Case: College Lectures

this is why I started doing this.

I take a lot of notes during lectures. But sometimes I miss things. The professor says something important while I am still writing the previous point. Or my hand hurts. Or I zone out. We all zone out sometimes.

So I started recording lectures on my phone. Just audio. Then every weekend I batch transcribe them all. Now I have a text file for every lecture. I can search through them. Copy quotes. Review before exams and I am not scrambling through messy handwritten notes.

The accuracy is surprisingly good. Even with professors who have accents. The medium model handles Indian English dialects pretty well. your mileage may vary depending on accent, background noise, and recording quality.

Pro tip: place your recording device closer to the speaker. Even a cheap phone recording from 3 feet away is way better than from 15 feet away. The cleaner the audio, the better the transcript.

Dealing with Imperfect Transcripts

Let me be real. Local Whisper is not perfect. You will get some errors:

Technical terms might get garbled. "Kubernetes" might come out as "coober netease." Fun.
Names and proper nouns are hit or miss.
If there is background noise, accuracy drops.
Overlapping speech is hard. If two people talk at once, Whisper gets confused.

For my lecture transcripts, I estimate about 95% accuracy with clean audio on medium model. That is enough to be useful. I just skim through and fix the obvious mistakes. Way faster than typing everything from scratch.

If you need higher accuracy, use the large model. But you will need the GPU VRAM for it.

Other Flags Worth Knowing

Here are some flags I use regularly:

--task translate

Transcribes and translates to English. Useful if your audio is in another language.

--beam_size 5

Higher beam size = slightly better accuracy but slower. Default is 5. You can try 1 for faster results.

--vad_filter True

Voice Activity Detection. Removes silence from processing. Speeds things up on files with lots of pauses.

--output_dir ./transcripts/

Saves all output files to a specific folder instead of next to the audio file.

FAQ

Which Whisper model should I use on Windows? Medium for the best balance of speed and accuracy if your machine can handle it. Small if you have limited RAM (8GB). Tiny only if you need speed and accuracy does not matter much.

Does faster-whisper work on Windows 10? Yes. Works on Windows 10 and 11. As long as you have Python 3.8+, you are good.

Can I transcribe video files? Whisper works on audio, not video. But you can extract the audio first with ffmpeg:

ffmpeg -i video.mp4 -vn -acodec mp3 audio.mp3

Then transcribe the audio file.

Does it work for languages other than English? Yes. Whisper supports 90+ languages. Specify with --language Tamil, --language Hindi, etc. Accuracy varies by language. English is the best. Some lower-resource languages are worse.

My GPU only has 4GB VRAM. Can I use it? Yes. Use --compute_type int8. It is designed for lower VRAM. Medium model works. Large does not.

How is this different from Rota AI? Rota AI is for real-time voice dictation. You speak, text appears instantly. This guide is for when you already have recorded audio files and need transcripts. They solve different problems. I use both.

Is faster-whisper free? Completely free. It is open source. No API keys, no accounts, no limits.

What audio formats are supported? MP3, WAV, M4A, FLAC, OGG, and most common audio formats. Anything ffmpeg can decode, Whisper can transcribe.

Written by Karthik Krishnan. I transcribe all my college lectures with faster-whisper so I can stop pretending I focus for 90 minutes straight. It works weirdly well.

Rota AI: How to Transcribe Audio Files with Whisper on Windows