How AI Voice Dictation Actually Works (The Full Pipeline) — Rota AI
How AI Voice Dictation Actually Works (The Full Pipeline)
TL;DR: Voice dictation is not one thing. It is a six step pipeline: audio capture, voice activity detection, transcription, AI cleanup, context awareness, and text injection. Here is how each step works and how Rota AI handles it.
I get asked this a lot. "How does Rota AI actually work?" People expect a simple answer. The truth is it is a pipeline with six distinct stages and each one matters.
Let me walk you through all of them. From the moment you speak to the moment text appears on your screen.
Step 1: Audio Capture
Everything starts with sound. Your microphone converts air pressure into electrical signals. Your computer converts those signals into digital audio. That is the easy part.
The tricky part is doing it right.
Most voice apps capture audio at 16 kilohertz, 16 bit, mono. That is the sweet spot for speech. Higher quality does not help much because human speech does not use that much frequency range. Lower quality and transcription accuracy drops fast.
Rota AI uses PyAudio under the hood. It opens a continuous audio stream from your default microphone and chunks the data into small buffers. Each buffer is pushed into a queue for processing.
The key detail here is latency. The audio buffer needs to be small enough that the system feels responsive but large enough that the transcription model gets usable data. I settled on 30 millisecond buffers after testing a bunch of sizes. Smaller than that and the model gets confused because it does not have enough context. Larger and there is a noticeable delay between speaking and seeing text.
I learned this the hard way. My first version used huge one second buffers. The transcription was more accurate but it felt like talking to a walkie talkie. Nobody wants that.
Your microphone quality matters more than people think. The built in mic on my Dell G15 is passable but a $30 USB condenser mic made accuracy jump by like 10 percent. Not because the model is different but because the input signal is cleaner.
Step 2: Voice Activity Detection (VAD)
Here is something most users never think about. The system is constantly hearing sound. Your fan. Your neighbor's dog. You clearing your throat. If it tried to transcribe everything, the output would be garbage.
This is where voice activity detection comes in. VAD decides which chunks of audio contain actual speech and which ones are silence or noise.
Rota AI uses Silero VAD. It is an open source model that runs locally. Super fast. It processes each audio chunk and assigns a probability that it contains human speech. If the probability goes above a threshold, Rota AI starts sending audio to the transcription model. When it drops below the threshold, it knows you stopped talking.
The threshold is configurable in settings. The default is 0.5. Lower it and the system picks up more of your speech but also more background noise. Raise it and it gets picky about what counts as talking.
I had a user email me saying Rota AI was cutting off the first word of every sentence. Classic VAD issue. The system was not fast enough to catch the onset of speech. I fixed it by adding a 200 millisecond pre buffer. Basically Rota AI keeps a small chunk of recent audio in memory and prepends it to the start of each speech segment. Now it catches the first word most of the time.
your mileage may vary depends on your mic placement and speaking style. People who start sentences softly tend to get cut off more. A headset mic helps because it is closer to your mouth.
Step 3: Transcription (Whisper)
This is the big one. The part everyone thinks of when they hear "voice dictation."
Rota AI supports multiple transcription backends. The default is a cloud API. You can also use a local Whisper model through Ollama or use Groq for free fast inference. The pipeline is the same regardless of backend.
The audio chunks that passed VAD get concatenated into a single segment. When you finish speaking, Rota AI sends the full audio segment to the transcription model. That is usually OpenAI Whisper or a compatible local alternative.
Whisper was trained on 680,000 hours of multilingual audio. It is absurdly good at what it does. But it is not perfect. It struggles with technical terms, ndian accents (sorry), proper nouns, and rare words.
I tested it with my own voice for weeks. It got my regular vocabulary fine about 95 percent of the time. But the moment I said something like "voice activity detection" or "Silero," it would hallucinate something random. "Victory detection." "Sylero." Stuff like that.
For local mode, I recommend the Whisper small model for most people. It runs on 8GB RAM machines and the accuracy is good enough for daily use. The medium model is noticeably better but needs 16GB. Large is overkill unless you have a GPU.
The transcription step takes anywhere from half a second to three seconds depending on your hardware and which backend you use. That is the longest single step in the pipeline.
Step 4: AI Cleanup
Raw transcription is messy. Whisper does not add punctuation. It does not capitalize properly. It outputs everything in lowercase with no formatting. If you talked for three minutes straight, you get a wall of lowercase text.
This is where the cleanup model comes in. After transcription, Rota AI passes the raw text to a smaller language model that fixes punctuation, capitalization, and basic grammar.
So "hey can you check the config file i think the vad threshold is too high and its cutting off my sentences" becomes "Hey, can you check the config file? I think the VAD threshold is too high and it is cutting off my sentences."
Big difference.
The cleanup step uses a lightweight model. Nothing huge. I tried using a big model for cleanup and it was overkill. It also slowed everything down by like two seconds. A small model handles punctuation and capitalization just fine and takes under a second.
One thing that trips people up: the cleanup model can hallucinate too. TBH this happens maybe 2 percent of the time. It might change "I saw" and "isaw" because the capitalization confused it. Or it adds a period where you wanted a comma. Most of the time it is correct but it is not something you should use for anything where exact wording matters, like legal transcription or medical notes. For emails, notes, chat messages, and coding? It is great.
I actually wrote a whole blog post about improving accuracy from 70 to 95 percent. A lot of that improvement comes from tuning this cleanup step.
Step 5: Context awareness
This is the step that separates good voice dictation from great voice dictation. And honestly, it is the step I am most proud of in Rota AI.
Context awareness means Rota AI pays attention to what app you are using and what kind of text you are dictating. It is not just dumping text. It is trying to understand the context.
For example. When you are dictating into VS Code or Cursor, Rota AI changes its behavior. It tries to preserve code formatting. It keeps camelCase and snake_case intact. It respects indentation. It does not randomly punctuate code like the cleanup model might normally do.
How does this work? Rota AI checks which application has focus. When you activate dictation, it detects the foreground app and adjusts the cleanup model's behavior. For code editors, it uses a specialized prompt that tells the cleanup model to preserve programming syntax. For email or chat apps, it uses normal punctuation rules.
The detection is not perfect. I will be honest about that. It works for the top 20 or so most common apps. If you are using some obscure enterprise tool, Rota AI might not recognize it and you will get default behavior. You can manually set per app behavior in settings though.
This is also where custom voice snippets come in. If you have snippets defined for a specific app, Rota AI will use those expansions during this step. Like if you have a snippet that expands "sign" into your email signature when you are in Outlook. That is context awareness too.
FR this feature was a pain to build. Managing app detection, switching prompts, and keeping it all fast enough that you do not notice a delay. But it makes a real difference in daily use.
Step 6: Text Injection
The final step. You have clean, context appropriate text. Now it needs to land in the right app.
Rota AI supports three injection methods.
SendInput is the default. It simulates keyboard input at the OS level. Most apps handle this fine. It is the fastest method and works for probably 80 percent of use cases.
Clipboard paste copies the text to your clipboard and sends Ctrl+V. More reliable for apps that do not process keyboard events well. The downside is it overwrites your clipboard contents.
pyautogui is the fallback of last resort. It moves your mouse, clicks the text field, and types. Slowest but it works when nothing else does.
In Rota AI settings, you can set a large text threshold. By default it is 200 characters. Short dictations use SendInput. Long ones automatically switch to clipboard paste because it handles big blocks of text better. Sending 500 characters one keystroke at a time is asking for trouble. I have seen it cause garbled text in Slack more times than I can count.
There is also an injection delay setting. If your target app is slow to process input, adding a small delay between characters can prevent text corruption. For most people the default works fine.
Putting It All Together
Here is what happens from start to finish when you hit your dictation hotkey and say "hey send a message to karthik that the build is ready":
- Audio capture picks up your voice through the microphone buffer
- VAD detects speech starting at "hey" and stops when you finish
- The full audio segment goes to Whisper and comes back as "hey send a message to karthik that the build is ready"
- The cleanup model adds punctuation: "Hey. Send a message to Karthik that the build is ready."
- Context awareness checks you are in Slack and adjusts formatting
- Text injection pastes the result into the active conversation
All of that happens in about two to four seconds. Sometimes faster.
Is it perfect? No. Every step has tradeoffs. VAD can miss the first word. Whisper can hallucinate on uncommon words. Cleanup can add wrong punctuation. Context detection can misidentify your app. Injection can garble text in weird apps.
But here is the thing. It works well enough that I use it every single day. And I am picky about this stuff. I built it. I know all the ways it can fail. and I still use it instead of typing most of the time.
That tells you something.
Frequently Asked Questions
Does Rota AI work offline? Yes. If you use a local Whisper model through Ollama, the entire pipeline runs on your machine. Audio capture, VAD, transcription, cleanup (with a local model), context detection, and injection all happen locally. The only step that needs cloud by default is transcription and you can swap that for local.
Why does my first word get cut off? That is a VAD issue. Enable the pre buffer in settings. It keeps a small chunk of pre speech audio and prepends it. Also check your VAD threshold. If it is too high, soft spoken words get ignored.
Is voice dictation accurate enough for professional use? For emails, notes, drafts, and casual communication, yes. For legal, medical, or any context where exact wording matters, do not rely on it without review. The accuracy is around 95 percent for normal speech and drops with accents, technical terms, or background noise.
Can I use voice dictation for coding? You can. When Rota AI detects VS Code or Cursor, it switches to a code aware mode that preserves syntax. It is great for comments, docstrings, and boilerplate. Not great for complex logic or precise variable names.
What microphone do I need? Your laptop mic works to get started. A $20 to $30 USB condenser mic makes a noticeable difference. A headset mic is the best for accuracy because it stays at a consistent distance from your mouth.
Does it work on Mac or Linux? Rota AI is Windows only for now. Mac support is on the roadmap but I do not have a timeline. and no I will not say "soon." I learned my lesson on that one.
Final Thoughts
Most people think voice dictation is just speech to text. Press a button, talk, text appears. The reality is way more complicated. There are at least six distinct stages between your mouth and the text on your screen and each one is a real engineering problem.
I think understanding the pipeline helps you use these tools better. When you know that VAD is what decides when you are speaking, you can adjust your mic placement. When you know cleanup is a separate step from transcription, you can understand why punctuation sometimes goes weird. When you know context detection looks at your foreground app, you can set per app preferences.
Rota AI is open source and free. If you want to see exactly how any of these steps work, the code is on GitHub. Fork it. Improve it. Send me a pull request.
Or just use it and enjoy talking to your computer. That works too.
