Gaze OCR: Talon support and 10 new features!

I’m excited to announce a major new release of my gaze-ocr system, chock-full of new features for both Dragonfly and Talon users! Thanks to a speedy new Talon/Cursorless/VSCode setup, the months since my last post have been some of my most productive ever, and a lot of this time was spent improving my integration with text recognition (OCR). As described and demoed in my earlier post, this package enables you to click, select, or position the caret adjacent to any text visible onscreen. It works best when combined with eye tracking, but one of the new features on Talon is that it will now work even without an eye tracker! In this post, I’ll give you a behind-the-scenes look into all the new features available. If you haven’t yet tried it out, I encourage you to install talon-gaze-ocr for Talon or gaze-ocr for Dragonfly so you can follow along.

For both Dragonfly and Talon

Two new features are available for both Dragonfly and Talon users: fine-grained positioning and phrase matching. To get these new features for Dragonfly, you’ll need to run “pip install –upgrade gaze-ocr screen-ocr”. I have made breaking changes to the API in order to support Talon, so look at the updated sample grammar in the readme and adjust your code as needed.

Fine-grained positioning: subwords and punctuation

You can now select and manipulate text smaller than a “word” as OCR recognizes it. For example, “MyVariable.” is recognized by both the Windows and Mac OCR APIs as a single word, and screen coordinate information is only reported at the edges of the entire character sequence. To get past that, my system uses a hybrid of keyboard and mouse simulation (click and press left/right) to get the selection or caret in the right place. Hence, you can separately say “select my”, “select variable”, and “select period”.

Phrase matching

In addition to (fuzzy) matching individual words on screen, you can now match sequences of words (or subwords). This is especially useful to reduce ambiguity, e.g. when multiple words or punctuation are visually nearby. Suppose you have the phrase “the birds, the bees, and the trees”, and you want to position the cursor after the second comma. You can achieve this with “go after the bees comma”.

For Talon (and ambitious Dragonfly users)

Several more new features are available today only on Talon. Under the covers, I continue to write the vast majority of functionality in a platform-agnostic way, but in practice the Talon ecosystem provides a ton of functionality that’s widely used and easy to integrate with, including speech recognition timestamps, an overlay drawing library, a standard system for declaring chained grammars, and a near-universally adopted community grammar which provides homophones and contextual dictation features. If Dragonfly users are interested in integrating with Dragonfly (or Caster) analogues of these features, I would be happy to link to examples.

Also, a note on stability: the talon-gaze-ocr repo is as stable as anything I have released on this blog, but less stable than core Talon repositories like knausj_talon or Cursorless. It relies on some private/unsupported APIs, so it could break at any time. I also regularly add and remove features as I try them out. Historically, most breakages have been caused by knausj_talon refactoring. If you are running into problems, I suggest reverting to the last working Talon, knausj_talon, and/or talon-gaze-ocr version and waiting for a fix.

Disambiguation

If multiple words or phrases onscreen near your eye gaze match your query, you’ll see numbers pop up allowing you to choose which one you are referring to (try “choose one” or “numbers hide”). This is in contrast to the old behavior of simply choosing the nearest instance, a risky prospect when you have “Don’t Send” and “Send” buttons adjacent to each other and you say “click send”, for example.

Although it’s an important feature for safety, disambiguation is undeniably annoying: it breaks your chain of thought and slows you down. To minimize disambiguation, I recommend using phrases instead of single words when working with dense text (e.g. when editing prose).

Eye tracker now optional

If you don’t have an eye tracker, you can now use the system just the same. Since we won’t be able to filter matches based on your gaze, you will see more disambiguation. I find this feature really helpful when I need to use my laptop without my eye tracker connected – all the same commands continue to work. I hope this will encourage some people who don’t have an eye tracker to try the system out … and then buy an eye tracker, because it’s a way better experience with that included!

Homophones

We will now automatically match onscreen words that sound the same as whatever came out of speech recognition. For example, if Talon recognizes “click here”, that’ll also match “hear”. Behind the scenes, this is using homophones.csv in knausj_talon. This is a really critical feature because Talon doesn’t (yet) consider what text is onscreen when recognizing your speech. And even if it did, we would still want this in case there are homophones onscreen near your gaze that we need to disambiguate.

Cursorless-inspired actions and modifiers

My system is optimized for prose, but I wanted to reuse the same mental “muscle memory” that I use in Cursorless. To that end, I added a set of command variants that mirror Cursorless style. For example, I can copy the contents of a form field containing the word “hello” with “copy all seen hello”. “Copy” is the action, “all” is the modifier which selects all, and “seen hello” is the mark (think: the word “hello” that was just seen).

Smart delete, replace, and insertion

If you enable the user.context_sensitive_dictation setting in knausj_talon, you’ll get automatic cleanup of spacing and capitalization with the talon-gaze-ocr commands. For example, if you delete a word in the middle of a sentence, we’ll clean up the adjacent spacing. Or perhaps Talon has typed “to be coma or not to be” and you say “replace coma with comma” – you’ll have “to be, or not to be” as a result. This is done through what’s affectionately known as “the cursor dance” – adjacent text is selected and examined to determine the context. This aids OCR by distinguishing wrapped text from a newline, among other edge cases.

Per-word gaze tracking

We’ll track your gaze as you speak each word in order to provide more accurate filtering. For example, if you speak a command like “select top through bottom”, where the words “top” and “bottom” are on the top and bottom of the page, we will note your eye movements as you speak the words and filter the screenshot separately for each word. In principle, this could also be used to enable effective chaining of multiple gaze-ocr commands, but currently all commands are anchored with “$” so they can’t be chained — feel free to adjust the grammar to your liking, though!

Dynamic search radius

Building on the per-word gaze tracking above, we’ll shrink/expand the search radius of onscreen text based on how much your eyes are moving around. This helps a lot with reducing disambiguation when you don’t need it (i.e. you’re staring at a word), but keeping it around when you do. Currently, the parameters behind these heuristics aren’t exposed through settings, because I continue to iterate regularly on the algorithm, but you’re welcome to hack on it yourself if it’s not working to your liking.

Debugging overlay

With all the above features, the system “just works” most of the time. But when it doesn’t, you may wonder why not. To address your curiosity, we will automatically pop up an OCR visualization any time a command fails (or when you request it with “OCR show”). This will show all of the recognized text near your gaze point, and a circle indicating the search space (see if you can reproduce the dynamic growing/shrinking described above!). If you are seeing problems with cursor/caret placement, you can also try “OCR show boxes” to show all the word bounding boxes.

More features

You can view all the available commands with “help context gaze OCR”. Additionally, there are some settings you can adjust (e.g. to show the debugging overlay longer), declared near the top of gaze_ocr_talon.py.

What’s Next

There are still plenty more features I would like to add, including ways to refer to words that are difficult to pronounce. I’m also aware that misbehavior caused by various sources (e.g. speech recognition, eye tracking, OCR error) is currently the biggest pain point, and is more noticeable when editing text vs. just clicking around. I’d love to hear more feedback and ideas from you all in the comments! You can also say hi on the Talon Slack channel #ext-gaze-ocr.

6 thoughts on “Gaze OCR: Talon support and 10 new features!”

  1. Wow this looks really cool!

    Would it be possible to have a radius based around the current mouse position? That would help folks who don’t have a gaze tracker, but are still able to get their pointer in the general vicinity…

    1. The eye tracker is an injected dependency that already has multiple implementations, so one could easily add an implementation that actually is backed by the mouse position. I’m skeptical as to whether this would be good experience, though … are you planning to use a conventional mouse and just depend on less precision as the advantage?

      1. Well I guess my question in response is what exactly does the gaze position do? Is it a matter of both disambiguation and performance?

        If so moving your cursor in the general vicinity of where you’re looking would provide both benefits as well right? Also I wonder if there’s a mode to limit it to the active window? I know you did a fair amount of work to do the OCR in the background to help performance.

        1. Gaze is indeed used for both disambiguation and performance but is most important for disambiguation. What you say would definitely work; you have the right mental model. It’s just that it adds more effort as now you have to both position mouse and say command. The eye tracking is very forgiving and designed to be basically effortless. I think you might find it’s easier to just use without eye or mouse tracking altogether and handle some occasional disambiguation. There are folks using this regularly without eye tracking. But if you want to give this mouse tracking idea a try it wouldn’t be hard to add given the way the interfaces are designed!

          1. Okay so I finally followed it the directions on the talon side and tried it out. Very cool so far. That said there have certainly been reasonable number of not founds where the screen shows a bunch of random looking text for a second and then it goes away. I think it tends to be more when I say things like “select word through word”. Or maybe it’s because I use dark mode most of the time? Or perhaps it’s on the smaller text?

            In any case this is very cool. Another question would be if it would be possible to limit searches to the active window? Do you apply a search from the cursor position outwards especially if there’s no eye tracker?

            1. Great to hear!

              You can adjust the setting user.ocr_debug_display_seconds if you want that text to persist longer — that’s a debugging display that shows both what it’s searching for (at the top of the screen) and what it sees (that’s all the random looking text you speak of). Usually either the spoken word was micrecognized, the onscreen text was misrecognized, or if you’re using eye tracker then maybe your gaze wasn’t close enough. If you issue a new command while that debugging display is active it’ll immediately hide it — you don’t have to worry about it interfering with OCR results.

              In general this should work well with small text and dark mode, so I’d be surprised if that’s it. There are kinda weird random failure cases for OCR, though (I’m using the OS APIs).

              Could definitely limit to active window. I already support filtering the search (for eye tracking) so would just need to adjust this to handle active window. In general I find it useful to do inactive window, although on Mac this is broken because you have to click once to focus and then again to select/click. That’s on my backlog of things I’d like to fix at some point (do the first click automatically). There are other places this crops up that aren’t as easy to detect (e.g. in some places such as Gmail you have to click once into form fields before selecting).

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.