Gaze OCR: Talon support and 10 new features!

I’m excited to announce a major new release of my gaze-ocr system, chock-full of new features for both Dragonfly and Talon users! Thanks to a speedy new Talon/Cursorless/VSCode setup, the months since my last post have been some of my most productive ever, and a lot of this time was spent improving my integration with text recognition (OCR). As described and demoed in my earlier post, this package enables you to click, select, or position the caret adjacent to any text visible onscreen. It works best when combined with eye tracking, but one of the new features on Talon is that it will now work even without an eye tracker! In this post, I’ll give you a behind-the-scenes look into all the new features available. If you haven’t yet tried it out, I encourage you to install talon-gaze-ocr for Talon or gaze-ocr for Dragonfly so you can follow along.

Continue reading Gaze OCR: Talon support and 10 new features!

Talon: In-Depth Review

Voice control systems come and go. Most fail to attract a significant following and eventually fizzle away. Every once in a while, though, one persists and shapes the community forever: Vocola, Utter Command, Dragonfly, and others before my time. In the last couple years, Talon has gained momentum as a new system targeted at power users. About a year ago it added Windows and Linux to the list of supported OSes (after Mac), and that’s when I started paying closer attention. For a while, I watched it improve from the sidelines. As someone who spent years developing my own grammar for Dragonfly, I wasn’t eager to rewrite it all or learn a new community grammar. On top of that, I had concerns that much of Talon was not open source. Despite this hesitance, I saw enough promise in my early experiments that I decided to take the plunge and test it out in earnest. My effort has been undoubtedly worthwhile. I’ve been impressed both by what Talon can do today and the velocity with which it is improving. I can confidently say that I’m more productive than ever, and have never been more excited about the future of voice computing. In this post, I’ll examine Talon’s capabilities and conclude with a detailed discussion of the implications of its partially closed-source model. Although the focus will be on Talon, I’ll occasionally compare it to Dragonfly as a helpful reference point, since it is a system that I (and many of my readers) have used for years.

Continue reading Talon: In-Depth Review

Gaze OCR: New release with Windows Runtime OCR support

In my previous post, I introduced my new gaze-ocr package for easy clicking or text editing in any app or website (demo video). If you haven’t upgraded the screen-ocr and gaze-ocr packages recently, go do that now. On Windows, I’ve added support for the built-in Windows Runtime OCR, which is incredibly fast (~40X faster than Tesseract!) and also very accurate. Be sure to follow the instructions to install the necessary dependencies, which includes Python 3.7 or 3.8 (3.9 isn’t quite ready yet). NatLink now supports Python 3 (32-bit only), but you need to follow special installation instructions while it is in beta. Upgrading is worth your time: WinRT is so fast that it opens up the possibility of processing the entire screen instead of just near the gaze point — although in practice I still find it’s helpful to restrict it somewhat.

I learned about WinRT OCR thanks to a comment from Ivan on my previous post. This is why I love open source software — I always learn from others once I share my work!

In a later post, I’ll share more details on all the experiments and tweaks that have gone into making this package as robust as it is today.

Say what you see: Efficient UI interaction with OCR and gaze tracking

User interfaces revolve around clicking on-screen text: descriptive links, named buttons, or editable text that can be selected and moved around. As a power user of voice control, I often bypass this with commands that simulate keyboard shortcuts or operate APIs directly. But this is only feasible for a small number of heavily-used apps and websites: it takes too long to add custom commands for everything. I’ve seen several ways to handle this long tail, but they all have issues. Speaking the on-screen text directly requires disambiguation if the same text occurs in multiple places. Numbering the clickable elements adds clutter and takes time to read. Implementations of both of these methods tend to only work in one app or another, leading to an inconsistent experience. Head or eye tracking can control the cursor anywhere, but they are tiring and accuracy isn’t good enough for precise text selection. As it turns out, however, the pieces for an effective system do exist — they just need to be put together.

Continue reading Say what you see: Efficient UI interaction with OCR and gaze tracking

New getting started guide

A lot has changed in the open-source speech control world in just the last year, much less the 5+ years since I started writing this blog. My own involvement has shifted towards longer-term projects and engaging the community through chat rooms on Gitter (I’m @wolfmanstout). Since a lot of people still discover hands-free computing through my blog, I want to help them get oriented in this new world. To that end, I’ve changed the handsfreecoding.org homepage into a structured introduction to my blog entries, along with some key updates and information about alternative approaches. Even if you’re a long-time reader, I encourage you to take a look and see if you learn something new! I plan to keep this new page up-to-date, although I’ll continue to complement that with (occasional) new blog posts. Please let me know what you think in the comments, and if you have any suggestions!

Enhanced text manipulation using accessibility APIs

For the past several years, we Dragon users have had to endure increasingly poor native support for text manipulation in third-party software such as Firefox and Chrome. For a while, Select-and-Say only worked on the most recent utterance; then it stopped working entirely. As someone who writes a lot of emails, it was a pain to lose this functionality, and the workaround of transferring to a text editor is slow and messes up the formatting when composing an inline reply to someone’s email. Nuance offers the Dragon Web Extension which supposedly fixes these issues, but in practice it has earned its 2 out of 5 star rating by slowing down command recognition, occasionally hanging the browser, and not working in key web apps such as Google Docs. Over the past few months, I’ve been working to integrate Dragonfly with the accessibility APIs that Chrome and Firefox natively support, which brings this functionality back — and much more. As of today, Windows support is now available, and I’m here to tell you how to leverage it and what’s under the hood.

Continue reading Enhanced text manipulation using accessibility APIs

Utter Command: Why I Rewrote My Entire Grammar

For years, I’ve been approaching speech recognition like a backend engineer: I have a flexible coding style for managing my grammars, I’ve implemented a lot of functionality, and I’ve added some helpful integrations. But embarrassingly, until recently, I hadn’t put much thought into the User Experience. This all changed after I received an email from Kim Patch, the author of Utter Command, a set of extensions to Dragon that has been around for decades. Continue reading Utter Command: Why I Rewrote My Entire Grammar

Mozilla DeepSpeech: Initial Release!

Last week, Mozilla announced the first official releases of DeepSpeech and Common Voice, their open source speech recognition system and speech dataset! They seem to have made a lot of progress on DeepSpeech in little time: they had a target of <10% word error rate and achieved 6.5%! This is a very strong result — for comparison Google boasts a 4.9% WER (albeit on different datasets). See their technical post for more details on how they pulled it off.

For this post, I’ll cover the basic information you’ll need to get it up and running on a Linux guest VM running on VirtualBox on a Windows host, since that’s my home setup. Note that the engine has not yet been integrated into any sort of real-time system, so what you’ll have at the end of this is a developers sandbox to play with — not something you can start using day-to-day. I do hope to eventually get it integrated into my daily workflow, but that’s going to take much more time.

UPDATE(12/25): If you are using Windows 10, consider running DeepSpeech natively on WSL (Windows Subsystem for Linux) instead of a VM, where you don’t have to compile from source and you’ll have faster recognition speed. Instructions here: https://fotidim.com/deepspeech-on-windows-wsl-287cb27557d4. If you run into problems with processor limitations, see info below on how to adjust CPU optimizations when compiling from source.
Continue reading Mozilla DeepSpeech: Initial Release!