I’m excited to announce a major new release of my gaze-ocr system, chock-full of new features for both Dragonfly and Talon users! Thanks to a speedy new Talon/Cursorless/VSCode setup, the months since my last post have been some of my most productive ever, and a lot of this time was spent improving my integration with text recognition (OCR). As described and demoed in my earlier post, this package enables you to click, select, or position the caret adjacent to any text visible onscreen. It works best when combined with eye tracking, but one of the new features on Talon is that it will now work even without an eye tracker! In this post, I’ll give you a behind-the-scenes look into all the new features available. If you haven’t yet tried it out, I encourage you to install talon-gaze-ocr for Talon or gaze-ocr for Dragonfly so you can follow along.
For both Dragonfly and Talon
Two new features are available for both Dragonfly and Talon users: fine-grained positioning and phrase matching. To get these new features for Dragonfly, you’ll need to run “pip install –upgrade gaze-ocr screen-ocr”. I have made breaking changes to the API in order to support Talon, so look at the updated sample grammar in the readme and adjust your code as needed.
Fine-grained positioning: subwords and punctuation
You can now select and manipulate text smaller than a “word” as OCR recognizes it. For example, “MyVariable.” is recognized by both the Windows and Mac OCR APIs as a single word, and screen coordinate information is only reported at the edges of the entire character sequence. To get past that, my system uses a hybrid of keyboard and mouse simulation (click and press left/right) to get the selection or caret in the right place. Hence, you can separately say “select my”, “select variable”, and “select period”.
In addition to (fuzzy) matching individual words on screen, you can now match sequences of words (or subwords). This is especially useful to reduce ambiguity, e.g. when multiple words or punctuation are visually nearby. Suppose you have the phrase “the birds, the bees, and the trees”, and you want to position the cursor after the second comma. You can achieve this with “go after the bees comma”.
For Talon (and ambitious Dragonfly users)
Several more new features are available today only on Talon. Under the covers, I continue to write the vast majority of functionality in a platform-agnostic way, but in practice the Talon ecosystem provides a ton of functionality that’s widely used and easy to integrate with, including speech recognition timestamps, an overlay drawing library, a standard system for declaring chained grammars, and a near-universally adopted community grammar which provides homophones and contextual dictation features. If Dragonfly users are interested in integrating with Dragonfly (or Caster) analogues of these features, I would be happy to link to examples.
Also, a note on stability: the talon-gaze-ocr repo is as stable as anything I have released on this blog, but less stable than core Talon repositories like knausj_talon or Cursorless. It relies on some private/unsupported APIs, so it could break at any time. I also regularly add and remove features as I try them out. Historically, most breakages have been caused by knausj_talon refactoring. If you are running into problems, I suggest reverting to the last working Talon, knausj_talon, and/or talon-gaze-ocr version and waiting for a fix.
If multiple words or phrases onscreen near your eye gaze match your query, you’ll see numbers pop up allowing you to choose which one you are referring to (try “choose one” or “numbers hide”). This is in contrast to the old behavior of simply choosing the nearest instance, a risky prospect when you have “Don’t Send” and “Send” buttons adjacent to each other and you say “click send”, for example.
Although it’s an important feature for safety, disambiguation is undeniably annoying: it breaks your chain of thought and slows you down. To minimize disambiguation, I recommend using phrases instead of single words when working with dense text (e.g. when editing prose).
Eye tracker now optional
If you don’t have an eye tracker, you can now use the system just the same. Since we won’t be able to filter matches based on your gaze, you will see more disambiguation. I find this feature really helpful when I need to use my laptop without my eye tracker connected – all the same commands continue to work. I hope this will encourage some people who don’t have an eye tracker to try the system out … and then buy an eye tracker, because it’s a way better experience with that included!
We will now automatically match onscreen words that sound the same as whatever came out of speech recognition. For example, if Talon recognizes “click here”, that’ll also match “hear”. Behind the scenes, this is using homophones.csv in knausj_talon. This is a really critical feature because Talon doesn’t (yet) consider what text is onscreen when recognizing your speech. And even if it did, we would still want this in case there are homophones onscreen near your gaze that we need to disambiguate.
Cursorless-inspired actions and modifiers
My system is optimized for prose, but I wanted to reuse the same mental “muscle memory” that I use in Cursorless. To that end, I added a set of command variants that mirror Cursorless style. For example, I can copy the contents of a form field containing the word “hello” with “copy all seen hello”. “Copy” is the action, “all” is the modifier which selects all, and “seen hello” is the mark (think: the word “hello” that was just seen).
Smart delete, replace, and insertion
If you enable the user.context_sensitive_dictation setting in knausj_talon, you’ll get automatic cleanup of spacing and capitalization with the talon-gaze-ocr commands. For example, if you delete a word in the middle of a sentence, we’ll clean up the adjacent spacing. Or perhaps Talon has typed “to be coma or not to be” and you say “replace coma with comma” – you’ll have “to be, or not to be” as a result. This is done through what’s affectionately known as “the cursor dance” – adjacent text is selected and examined to determine the context. This aids OCR by distinguishing wrapped text from a newline, among other edge cases.
Per-word gaze tracking
We’ll track your gaze as you speak each word in order to provide more accurate filtering. For example, if you speak a command like “select top through bottom”, where the words “top” and “bottom” are on the top and bottom of the page, we will note your eye movements as you speak the words and filter the screenshot separately for each word. In principle, this could also be used to enable effective chaining of multiple gaze-ocr commands, but currently all commands are anchored with “$” so they can’t be chained — feel free to adjust the grammar to your liking, though!
Dynamic search radius
Building on the per-word gaze tracking above, we’ll shrink/expand the search radius of onscreen text based on how much your eyes are moving around. This helps a lot with reducing disambiguation when you don’t need it (i.e. you’re staring at a word), but keeping it around when you do. Currently, the parameters behind these heuristics aren’t exposed through settings, because I continue to iterate regularly on the algorithm, but you’re welcome to hack on it yourself if it’s not working to your liking.
With all the above features, the system “just works” most of the time. But when it doesn’t, you may wonder why not. To address your curiosity, we will automatically pop up an OCR visualization any time a command fails (or when you request it with “OCR show”). This will show all of the recognized text near your gaze point, and a circle indicating the search space (see if you can reproduce the dynamic growing/shrinking described above!). If you are seeing problems with cursor/caret placement, you can also try “OCR show boxes” to show all the word bounding boxes.
You can view all the available commands with “help context gaze OCR”. Additionally, there are some settings you can adjust (e.g. to show the debugging overlay longer), declared near the top of gaze_ocr_talon.py.
There are still plenty more features I would like to add, including ways to refer to words that are difficult to pronounce. I’m also aware that misbehavior caused by various sources (e.g. speech recognition, eye tracking, OCR error) is currently the biggest pain point, and is more noticeable when editing text vs. just clicking around. I’d love to hear more feedback and ideas from you all in the comments! You can also say hi on the Talon Slack channel #ext-gaze-ocr.