For the past several years, we Dragon users have had to endure increasingly poor native support for text manipulation in third-party software such as Firefox and Chrome. For a while, Select-and-Say only worked on the most recent utterance; then it stopped working entirely. As someone who writes a lot of emails, it was a pain to lose this functionality, and the workaround of transferring to a text editor is slow and messes up the formatting when composing an inline reply to someone’s email. Nuance offers the Dragon Web Extension which supposedly fixes these issues, but in practice it has earned its 2 out of 5 star rating by slowing down command recognition, occasionally hanging the browser, and not working in key web apps such as Google Docs. Over the past few months, I’ve been working to integrate Dragonfly with the accessibility APIs that Chrome and Firefox natively support, which brings this functionality back — and much more. As of today, Windows support is now available, and I’m here to tell you how to leverage it and what’s under the hood.Continue reading Enhanced text manipulation using accessibility APIs
For years, I’ve been approaching speech recognition like a backend engineer: I have a flexible coding style for managing my grammars, I’ve implemented a lot of functionality, and I’ve added some helpful integrations. But embarrassingly, until recently, I hadn’t put much thought into the User Experience. This all changed after I received an email from Kim Patch, the author of Utter Command, a set of extensions to Dragon that has been around for decades. Continue reading Utter Command: Why I Rewrote My Entire Grammar
Thanks to the work of several volunteers and an anonymous helper at Nuance, the latest version of Dragon NaturallySpeaking (DPI 15) now works with NatLink and Dragonfly! I’ve been testing it for the last couple weeks and it works well with only a few minor issues to work around.
Continue reading Dragon 15 now works with NatLink and Dragonfly
Last week, Mozilla announced the first official releases of DeepSpeech and Common Voice, their open source speech recognition system and speech dataset! They seem to have made a lot of progress on DeepSpeech in little time: they had a target of <10% word error rate and achieved 6.5%! This is a very strong result — for comparison Google boasts a 4.9% WER (albeit on different datasets). See their technical post for more details on how they pulled it off.
For this post, I’ll cover the basic information you’ll need to get it up and running on a Linux guest VM running on VirtualBox on a Windows host, since that’s my home setup. Note that the engine has not yet been integrated into any sort of real-time system, so what you’ll have at the end of this is a developers sandbox to play with — not something you can start using day-to-day. I do hope to eventually get it integrated into my daily workflow, but that’s going to take much more time.
UPDATE(12/25): If you are using Windows 10, consider running DeepSpeech natively on WSL (Windows Subsystem for Linux) instead of a VM, where you don’t have to compile from source and you’ll have faster recognition speed. Instructions here: https://fotidim.com/deepspeech-on-windows-wsl-287cb27557d4. If you run into problems with processor limitations, see info below on how to adjust CPU optimizations when compiling from source.
Continue reading Mozilla DeepSpeech: Initial Release!
Firefox has gained a lot of exciting updates recently that make it very competitive with Chrome. Try it out if you haven’t already (I use the developer edition). Because both browsers now use the same extension API, I’ve just published my hands-free browsing extensions to both Firefox and Chrome repositories.
The second is a fork of Vimium that I’m calling Modeless Keyboard Navigation (get for Firefox or Chrome) to avoid confusion with Vimium. Unlike Vimium, the keyboard shortcuts can be used at any time, and the default bindings use modifier keys (think Emacs, not Vim). I find this much faster for voice control, where mode switching means a round-trip to Dragon.
Hope you find them useful! If you’ve discovered or created any useful browser extensions that help with voice control, please post them in the comments.
I learned about a couple very exciting new developments this week in open source speech recognition, both coming from Mozilla. The first is that a year and a half ago, Mozilla quietly started working on an open source, TensorFlow-based DeepSpeech implementation. DeepSpeech is a state-of-the-art deep-learning-based speech recognition system designed by Baidu and described in detail in their research paper. Currently, Mozilla’s implementation requires that users train their own speech models, which is a resource-intensive process that requires expensive closed-source speech data to get a good model. But that brings me to Mozilla’s more recent announcement: Project Common Voice. Their goal is to crowd-source collection of 10,000 hours of speech data and open source it all. Once this is done, DeepSpeech can be used to train a high-quality open source recognition engine which can easily be distributed and used by anyone!
This is a Big Deal for hands-free coding. For years I have increasingly felt that the bottleneck in my hands-free system is that I can’t do anything beneath the limited API that Dragon offers. I can’t hook into the pure dictation and editing system, I can’t improve the built-in UIs for text editing or training words/phrases, I’m limited to getting results from complete utterances after a pause, and I can’t improve Dragon’s OS-level integration or port it to Linux. If an open source speech recognition engine becomes available that can compete with Dragon in latency and quality, all of this becomes possible.
To accelerate progress towards this new world of end-to-end open source hands-free coding, I encourage everyone to contribute their voice to Project Common Voice, and share Mozilla’s blog post through social media.
Tobii has released a new consumer eye tracker, the Tobii Eye Tracker 4C for $150. Although I haven’t found eye tracking to be nearly as helpful as speech recognition, it is handy for those occasional situations where you just want to click a button or change context and you don’t have any command to do so (see my earlier post for details). I have been pretty happy with the Tobii EyeX, but it isn’t perfect, so I was excited to try out this new device. Continue reading Tobii Eye Tracker 4C Review
Not related to coding, but hands-free coders need to have some fun too. 🙂
I discovered recently that Hearthstone can be easily played with eye/head tracking and minimal voice controls (move pointer and click), thanks to the turn-based interface, large click targets, and a high thinking-to-clicking ratio. I don’t even use a custom grammar and it works very well. If you have a good experience, you can thank Blizzard on this thread I started. Hopefully I didn’t just set the voice coding community back by a few months!
If you know of other games that play well with hands-free control, please post in the comments.
Like many an Emacs user, I am enamored with Org-Mode. Every great coding session begins with organizing your thoughts, and Org-Mode is an excellent tool for the job. If you’re tracking New Year’s resolutions, it’s great for that too. Since Org-Mode already has an excellent compact guide, I’ll focus on my voice bindings and finish with a bonus section on how I like to structure my personal to do lists. Continue reading Getting organized with Org mode
When I find myself writing or editing something sufficiently long, I like to have full support for Select-and-Say. I used to use “open dictation box”, since that’s the obvious choice, until I discovered that using Notepad is much faster. Continue reading Avoid the dictation box