Say what you see: Efficient UI interaction with OCR and gaze tracking

User interfaces revolve around clicking on-screen text: descriptive links, named buttons, or editable text that can be selected and moved around. As a power user of voice control, I often bypass this with commands that simulate keyboard shortcuts or operate APIs directly. But this is only feasible for a small number of heavily-used apps and websites: it takes too long to add custom commands for everything. I’ve seen several ways to handle this long tail, but they all have issues. Speaking the on-screen text directly requires disambiguation if the same text occurs in multiple places. Numbering the clickable elements adds clutter and takes time to read. Implementations of both of these methods tend to only work in one app or another, leading to an inconsistent experience. Head or eye tracking can control the cursor anywhere, but they are tiring and accuracy isn’t good enough for precise text selection. As it turns out, however, the pieces for an effective system do exist — they just need to be put together.

The solution is conceptually simple. OCR is used to extract on-screen text from any application for direct reference in commands. If multiple matches are found, gaze tracking can be used to find the nearest one. The result is a powerful and intuitive experience:

I’ve created a couple Python packages on PyPI to make this easy to integrate into your own setup. For Dragonfly and Tobii eye tracker users, I recommend integrating with gaze-ocr, which provides ready-to-use Dragonfly Actions for cursor movement and text selection. This has been tested with Tobii models 4C and 5. Here’s a full working grammar demonstrating this:

import gaze_ocr
import screen_ocr  # dependency of gaze-ocr

from dragonfly import (
    Dictation,
    Grammar,
    Key,
    MappingRule,
    Mouse,
    Text
)

# See installation instructions:
# https://github.com/wolfmanstout/gaze-ocr
DLL_DIRECTORY = "c:/Users/james/Downloads/tobii.interaction.0.7.3/"

# Initialize eye tracking and OCR.
tracker = gaze_ocr.eye_tracking.EyeTracker.get_connected_instance(DLL_DIRECTORY)
ocr_reader = screen_ocr.Reader.create_fast_reader()
gaze_ocr_controller = gaze_ocr.Controller(ocr_reader, tracker)


class CommandRule(MappingRule):
    mapping = {
        # Click on text.
        "<text> click": gaze_ocr_controller.move_cursor_to_word_action("%(text)s") + Mouse("left"),

        # Move the cursor for text editing.
        "go before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Mouse("left"),
        "go after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Mouse("left"),

        # Select text starting from the current position.
        "words before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Key("shift:down") + Mouse("left") + Key("shift:up"),
        "words after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Key("shift:down") + Mouse("left") + Key("shift:up"),

        # Select a phrase or range of text.
        "words <text> [through <text2>]": gaze_ocr_controller.select_text_action("%(text)s", "%(text2)s"),

        # Select and replace text.
        "replace <text> with <replacement>": gaze_ocr_controller.select_text_action("%(text)s") + Text("%(replacement)s"),
    }

    extras = [
        Dictation("text"),
        Dictation("text2"),
        Dictation("replacement"),
    ]

    def _process_begin(self):
        # Start OCR now so that results are ready when the command completes.
        gaze_ocr_controller.start_reading_nearby()


grammar = Grammar("ocr_test")
grammar.add_rule(CommandRule())
grammar.load()


# Unload function which will be called by natlink at unload time.
def unload():
    global grammar
    if grammar: grammar.unload()
    grammar = None

You can also try ocr_reader = screen_ocr.Reader.create_quality_reader() if you have a fast CPU (this trades off speed for higher OCR accuracy).

Why override _process_begin()? OCR is not instantaneous, so if we start processing the screen contents after the command is recognized, there will be a noticeable delay before it finishes. Instead, we start OCR on a background thread as soon the utterance begins, so that it’s done by the time the utterance is complete. It’s a bit wasteful since not every command requires OCR, but it’s well worth it. Another advantage is that the screenshot is taken at the beginning of the command, so you don’t have to hold your gaze while you speak. If you’d rather wait, though, you can use a Function action to call start_reading_nearby() right before the gaze_ocr_controller action.

If you are using Dragonfly with NatLink, you may find that the eye tracking lags behind. This should be solved with recent versions of Dragonfly (see FAQ), so try upgrading.

If you aren’t using Dragonfly or Tobii, I factored the OCR smarts into a separate screen-ocr package (which gaze-ocr depends on). You can have a look at my gaze-ocr implementation and adapt it to your needs.

Overall, the system works pretty darn well, but not perfectly. The OCR is accurate about 90% of the time: enough to be generally pleasant to use, but annoying at times. Also, to achieve low latency, I have to crop the screenshot nearby the gaze point, even though your eyes may want to jump around. In practice it’s pretty natural to look at the word that you are reading, but if you are selecting a range of text you may find that you need to dwell on the start word of the range a bit longer for it to work.

All things considered, this system works so well for me that I didn’t want to keep it from you all any longer. In a future blog post I’ll detail the interesting journey that led to the current implementation and share some ideas for future improvements. In the meantime, please try it out and tell me what you think!

17 thoughts on “Say what you see: Efficient UI interaction with OCR and gaze tracking”

  1. Once again amazing work! Thank you for all you’ve contributed to the voice coding community and pushing the limits of open source technologies.

        1. It seems about the same. For what it’s worth, they didn’t really advertise any improvement in accuracy. It also comes with a software “upgrade” but annoyingly it makes it more difficult to toggle eye tracking on and off (which I do for computer games). Apparently the head tracking is improved, which I plan to explore for future enhancements …

          https://help.tobii.com/hc/en-us/articles/360008539058

  2. Had you thought of combining this with a text reader for blind or partially sighted users?

    Very flexible system, though the command set seems as “easy” as using emacs 🙂

    1. Hah yes I have a somewhat unusual grammar style but there is a method to the madness:
      https://handsfreecoding.org/2018/09/04/utter-command-why-i-rewrote-my-entire-grammar/

      The Python library itself doesn’t assume any particular command bindings, so you are free to use whatever you want.

      Yes I think that OCR could definitely be helpful for a screenreader. Too many worthwhile projects, not enough time! The screen-ocr library is easily reused by others, though, in case someone is interested.

  3. Hi! This is very interesting, as I’ve implemented a very similar thing myself recently. Which OCR library do you use? I’ve tried tesseract, but found it too slow and inaccurate (the newer lstm backend was better but still not good enough) – it was okay-ish for websites, but couldn’t parse Windows Explorer fonts for example. I’ve then switched to Windows.Media.Ocr .Net library (unfortunately that ties me to the OS) which turned out to be two orders of magnitude faster. I also don’t capture the entire screen, only a portion of it around the gaze location, which makes the Windows OCR return in just a few milliseconds.

    1. I was hoping to get suggestions like this, thank you!

      I am in fact using Tesseract LSTM, unreleased version 5.0 as linked from my readme. I do my own image binarization, which is crucial for quality (I’ll get into that in a later blog post). I don’t have any trouble with Windows Explorer fonts. The only common case I still see failures is the taskbar, because of the inverted color scheme.

      I agree that Tesseract is slow but I’m able to hide the processing time for the most part by starting when the utterance begins, not when it is parsed.

      I had no idea that Windows had a built-in OCR library! I can definitely believe that it’s much faster than Tesseract. Unfortunately, closed source has been moving pretty quickly in this space and open source hasn’t fully caught up. I was impressed by the quality of the recent EasyOCR library, though:
      https://github.com/JaidedAI/EasyOCR

      The latency is variable but can be pretty slow even when running on GPU (and I have an RTX 2070 Super…).

      I will be curious to try out the Windows OCR. I’m wondering whether I can call it directly from pythonnet or whether I would have to compile a UWP app to act as intermediary.

      1. Interesting, i have not played much with image preprocessing before recognition. It might be that sub pixel font alignment is impacting it negatively or something? Windows OCR can be called from a regular Windows desktop application, as long as you reference the Microsoft.Windows.SDK.Contracts NuGet package. I build an intermediate C++/CLI library with C extern wrapper functions, so that i can load it with python ctypes. You can see my code here https://github.com/ileben/EyeTrackingHooks

          1. I got this working directly through Python, using the Microsoft-built winrt PyPI package. That package requires Python 3.7 because it uses its asyncio library, but otherwise it makes WinRT functionality seem like it’s native Python. Here’s a gist demonstrating this in action:
            https://gist.github.com/wolfmanstout/5e8a286176f432d006640e3c1c4b45c1

            Looks like this library also makes it possible to use Microsoft’s standard Gaze API instead of depending on the Tobii DLLs. I haven’t fully tested this but the imports appear to work without throwing any errors.

            I also tested this OCR implementation against an evaluation data set that I’ve been collecting over the course of several months. The quality (with no preprocessing) is comparable to what I get from Tesseract after applying my custom preprocessing, but the latency is indeed way better (>40X!!). That’s a big deal because it means I don’t have to crop anywhere near as close to the gaze point as I currently do. In fact I don’t think I would have to crop at all, thanks to my trick where I start processing as soon as the utterance begins. I look forward to fully integrating this when I have time (it’ll require me to upgrade my NatLink Python to 3.7, which currently requires cloning the NatLink repo, which I haven’t yet tried).

            Of course, like you said, it’s unfortunate that this will introduce a dependency on Windows, so I’m glad that I have a working fully-open-source alternative.

  4. James great work here. A couple of questions:
    – have you tested the OCR in dark mode? (And by extension of that in different color modes like sublime text, VS code, and different terminals have for example?)
    – Does this require a gaze tracker, or does that just help make the selection more accurate?

    1. I’ve been using this for months so I’ve tested it in many different contexts. I haven’t specifically used dark mode, but plenty of webpages have dark backgrounds (including this one in several places). My preprocessing is specifically designed to handle both light-on-dark and dark-on-light backgrounds, including multiple combined within a single screenshot. I also use Flux and that doesn’t seem to cause any significant problems despite reducing contrast.

      It does currently require a gaze tracker. It’s currently essential to crop the image down to a smaller size because Tesseract is fairly slow. I currently default to 200×200 patches near gaze. I’m investigating other OCR backends, some of which are faster and might allow this to be used without a gaze tracker. Still, it’s extremely valuable for disambiguation between multiple instances of the same text so I think it will always be a fairly essential part of the setup. Without this, you won’t have the same level of confidence that it will click what you want (that’s an area where the current implementation really shines — you don’t have to worry about misclicks).

      1. Thanks James, yes I was digging around looking at pytesseract which looks rather cool. I remember that caster Legion grid implemented something similar, but without the OCR, just boundary box detection for text, that worked fairly well. (Although certainly not as seamless as what you’ve implemented). That said Tesseract does seem to have several different modes for just the boundary detection and character versus word and others. The Tobii 4c seems rather hard to come by these days.

        1. Yes, (py)tesseract is pretty powerful in terms of the information it gives you.

          You can get the new Tobii 5, that works just fine with my gaze-ocr library.

    2. I could see this even being helpful with a mouse cursor as an alternative for eye tracking hardware.

      This also could be helpful for a hybrid for those eye tracking applications I don’t have a robust API but actually move the mouse cursor. From the cursor position the gaze can be extrapolated with a radius.

      1. Sure, that could also work! The way I wrote the code, it shouldn’t be difficult to provide an alternative implementation of the eye tracker (without even making any changes to the library).

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.