Say what you see: Efficient UI interaction with OCR and gaze tracking

User interfaces revolve around clicking on-screen text: descriptive links, named buttons, or editable text that can be selected and moved around. As a power user of voice control, I often bypass this with commands that simulate keyboard shortcuts or operate APIs directly. But this is only feasible for a small number of heavily-used apps and websites: it takes too long to add custom commands for everything. I’ve seen several ways to handle this long tail, but they all have issues. Speaking the on-screen text directly requires disambiguation if the same text occurs in multiple places. Numbering the clickable elements adds clutter and takes time to read. Implementations of both of these methods tend to only work in one app or another, leading to an inconsistent experience. Head or eye tracking can control the cursor anywhere, but they are tiring and accuracy isn’t good enough for precise text selection. As it turns out, however, the pieces for an effective system do exist — they just need to be put together.

The solution is conceptually simple. OCR is used to extract on-screen text from any application for direct reference in commands. If multiple matches are found, gaze tracking can be used to find the nearest one. The result is a powerful and intuitive experience:

I’ve created a couple Python packages on PyPI to make this easy to integrate into your own setup. For Dragonfly and Tobii eye tracker users, I recommend integrating with gaze-ocr, which provides ready-to-use Dragonfly Actions for cursor movement and text selection. This has been tested with Tobii models 4C and 5. Here’s a full working grammar demonstrating this:

import gaze_ocr
import screen_ocr  # dependency of gaze-ocr

from dragonfly import (
    Dictation,
    Grammar,
    Key,
    MappingRule,
    Mouse,
    Text
)

# See installation instructions:
# https://github.com/wolfmanstout/gaze-ocr
DLL_DIRECTORY = "c:/Users/james/Downloads/tobii.interaction.0.7.3/"

# Initialize eye tracking and OCR.
tracker = gaze_ocr.eye_tracking.EyeTracker.get_connected_instance(DLL_DIRECTORY)
ocr_reader = screen_ocr.Reader.create_fast_reader()
gaze_ocr_controller = gaze_ocr.Controller(ocr_reader, tracker)


class CommandRule(MappingRule):
    mapping = {
        # Click on text.
        "<text> click": gaze_ocr_controller.move_cursor_to_word_action("%(text)s") + Mouse("left"),

        # Move the cursor for text editing.
        "go before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Mouse("left"),
        "go after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Mouse("left"),

        # Select text starting from the current position.
        "words before <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "before") + Key("shift:down") + Mouse("left") + Key("shift:up"),
        "words after <text>": gaze_ocr_controller.move_cursor_to_word_action("%(text)s", "after") + Key("shift:down") + Mouse("left") + Key("shift:up"),

        # Select a phrase or range of text.
        "words <text> [through <text2>]": gaze_ocr_controller.select_text_action("%(text)s", "%(text2)s"),

        # Select and replace text.
        "replace <text> with <replacement>": gaze_ocr_controller.select_text_action("%(text)s") + Text("%(replacement)s"),
    }

    extras = [
        Dictation("text"),
        Dictation("text2"),
        Dictation("replacement"),
    ]

    def _process_begin(self):
        # Start OCR now so that results are ready when the command completes.
        gaze_ocr_controller.start_reading_nearby()


grammar = Grammar("ocr_test")
grammar.add_rule(CommandRule())
grammar.load()


# Unload function which will be called by natlink at unload time.
def unload():
    global grammar
    if grammar: grammar.unload()
    grammar = None

You can also try ocr_reader = screen_ocr.Reader.create_quality_reader() if you have a fast CPU (this trades off speed for higher OCR accuracy).

Why override _process_begin()? OCR is not instantaneous, so if we start processing the screen contents after the command is recognized, there will be a noticeable delay before it finishes. Instead, we start OCR on a background thread as soon the utterance begins, so that it’s done by the time the utterance is complete. It’s a bit wasteful since not every command requires OCR, but it’s well worth it. Another advantage is that the screenshot is taken at the beginning of the command, so you don’t have to hold your gaze while you speak. If you’d rather wait, though, you can use a Function action to call start_reading_nearby() right before the gaze_ocr_controller action.

If you are using Dragonfly with NatLink, you may find that the eye tracking lags behind. This should be solved with recent versions of Dragonfly (see FAQ), so try upgrading.

If you aren’t using Dragonfly or Tobii, I factored the OCR smarts into a separate screen-ocr package (which gaze-ocr depends on). You can have a look at my gaze-ocr implementation and adapt it to your needs.

Overall, the system works pretty darn well, but not perfectly. The OCR is accurate about 90% of the time: enough to be generally pleasant to use, but annoying at times. Also, to achieve low latency, I have to crop the screenshot nearby the gaze point, even though your eyes may want to jump around. In practice it’s pretty natural to look at the word that you are reading, but if you are selecting a range of text you may find that you need to dwell on the start word of the range a bit longer for it to work.

All things considered, this system works so well for me that I didn’t want to keep it from you all any longer. In a future blog post I’ll detail the interesting journey that led to the current implementation and share some ideas for future improvements. In the meantime, please try it out and tell me what you think!

38 thoughts on “Say what you see: Efficient UI interaction with OCR and gaze tracking”

  1. Once again amazing work! Thank you for all you’ve contributed to the voice coding community and pushing the limits of open source technologies.

      1. With your testing have you found a substantial difference with accuracy between Tobii models 4C and 5?

        1. It seems about the same. For what it’s worth, they didn’t really advertise any improvement in accuracy. It also comes with a software “upgrade” but annoyingly it makes it more difficult to toggle eye tracking on and off (which I do for computer games). Apparently the head tracking is improved, which I plan to explore for future enhancements …

          https://help.tobii.com/hc/en-us/articles/360008539058

  2. Had you thought of combining this with a text reader for blind or partially sighted users?

    Very flexible system, though the command set seems as “easy” as using emacs 🙂

    1. Hah yes I have a somewhat unusual grammar style but there is a method to the madness:
      https://handsfreecoding.org/2018/09/04/utter-command-why-i-rewrote-my-entire-grammar/

      The Python library itself doesn’t assume any particular command bindings, so you are free to use whatever you want.

      Yes I think that OCR could definitely be helpful for a screenreader. Too many worthwhile projects, not enough time! The screen-ocr library is easily reused by others, though, in case someone is interested.

  3. Hi! This is very interesting, as I’ve implemented a very similar thing myself recently. Which OCR library do you use? I’ve tried tesseract, but found it too slow and inaccurate (the newer lstm backend was better but still not good enough) – it was okay-ish for websites, but couldn’t parse Windows Explorer fonts for example. I’ve then switched to Windows.Media.Ocr .Net library (unfortunately that ties me to the OS) which turned out to be two orders of magnitude faster. I also don’t capture the entire screen, only a portion of it around the gaze location, which makes the Windows OCR return in just a few milliseconds.

    1. I was hoping to get suggestions like this, thank you!

      I am in fact using Tesseract LSTM, unreleased version 5.0 as linked from my readme. I do my own image binarization, which is crucial for quality (I’ll get into that in a later blog post). I don’t have any trouble with Windows Explorer fonts. The only common case I still see failures is the taskbar, because of the inverted color scheme.

      I agree that Tesseract is slow but I’m able to hide the processing time for the most part by starting when the utterance begins, not when it is parsed.

      I had no idea that Windows had a built-in OCR library! I can definitely believe that it’s much faster than Tesseract. Unfortunately, closed source has been moving pretty quickly in this space and open source hasn’t fully caught up. I was impressed by the quality of the recent EasyOCR library, though:
      https://github.com/JaidedAI/EasyOCR

      The latency is variable but can be pretty slow even when running on GPU (and I have an RTX 2070 Super…).

      I will be curious to try out the Windows OCR. I’m wondering whether I can call it directly from pythonnet or whether I would have to compile a UWP app to act as intermediary.

      1. Interesting, i have not played much with image preprocessing before recognition. It might be that sub pixel font alignment is impacting it negatively or something? Windows OCR can be called from a regular Windows desktop application, as long as you reference the Microsoft.Windows.SDK.Contracts NuGet package. I build an intermediate C++/CLI library with C extern wrapper functions, so that i can load it with python ctypes. You can see my code here https://github.com/ileben/EyeTrackingHooks

          1. I got this working directly through Python, using the Microsoft-built winrt PyPI package. That package requires Python 3.7 because it uses its asyncio library, but otherwise it makes WinRT functionality seem like it’s native Python. Here’s a gist demonstrating this in action:
            https://gist.github.com/wolfmanstout/5e8a286176f432d006640e3c1c4b45c1

            Looks like this library also makes it possible to use Microsoft’s standard Gaze API instead of depending on the Tobii DLLs. I haven’t fully tested this but the imports appear to work without throwing any errors.

            I also tested this OCR implementation against an evaluation data set that I’ve been collecting over the course of several months. The quality (with no preprocessing) is comparable to what I get from Tesseract after applying my custom preprocessing, but the latency is indeed way better (>40X!!). That’s a big deal because it means I don’t have to crop anywhere near as close to the gaze point as I currently do. In fact I don’t think I would have to crop at all, thanks to my trick where I start processing as soon as the utterance begins. I look forward to fully integrating this when I have time (it’ll require me to upgrade my NatLink Python to 3.7, which currently requires cloning the NatLink repo, which I haven’t yet tried).

            Of course, like you said, it’s unfortunate that this will introduce a dependency on Windows, so I’m glad that I have a working fully-open-source alternative.

              1. I think the answer today is “yes” but this is probably not a large change to remove this dependency. You shouldn’t need to install the Tesseract binaries, just the Python package dependencies.

                1. thanks James. Also curious what performance are you seeing on Windows native OCR to process the entire screen? and for what screen size? does about 2 seconds sound like what you’re getting? ( I was trying this on 30 inch monitor mind you:-) )

                  1. That sounds plausible for the entire screen, but I always do a smaller patch (e.g. “radius” of 150 which means 300×300).

              2. I just updated screen-ocr so that it no longer requires Tesseract, and makes it easy to install any of the backends:
                https://pypi.org/project/screen-ocr/

                Note also that WinRT now supports Python 3.8 so this continues to be an excellent and well-maintained option. I’ve been using this a while now as part of my standard setup and it’s fantastic.

  4. James great work here. A couple of questions:
    – have you tested the OCR in dark mode? (And by extension of that in different color modes like sublime text, VS code, and different terminals have for example?)
    – Does this require a gaze tracker, or does that just help make the selection more accurate?

    1. I’ve been using this for months so I’ve tested it in many different contexts. I haven’t specifically used dark mode, but plenty of webpages have dark backgrounds (including this one in several places). My preprocessing is specifically designed to handle both light-on-dark and dark-on-light backgrounds, including multiple combined within a single screenshot. I also use Flux and that doesn’t seem to cause any significant problems despite reducing contrast.

      It does currently require a gaze tracker. It’s currently essential to crop the image down to a smaller size because Tesseract is fairly slow. I currently default to 200×200 patches near gaze. I’m investigating other OCR backends, some of which are faster and might allow this to be used without a gaze tracker. Still, it’s extremely valuable for disambiguation between multiple instances of the same text so I think it will always be a fairly essential part of the setup. Without this, you won’t have the same level of confidence that it will click what you want (that’s an area where the current implementation really shines — you don’t have to worry about misclicks).

      1. Thanks James, yes I was digging around looking at pytesseract which looks rather cool. I remember that caster Legion grid implemented something similar, but without the OCR, just boundary box detection for text, that worked fairly well. (Although certainly not as seamless as what you’ve implemented). That said Tesseract does seem to have several different modes for just the boundary detection and character versus word and others. The Tobii 4c seems rather hard to come by these days.

        1. Yes, (py)tesseract is pretty powerful in terms of the information it gives you.

          You can get the new Tobii 5, that works just fine with my gaze-ocr library.

    2. I could see this even being helpful with a mouse cursor as an alternative for eye tracking hardware.

      This also could be helpful for a hybrid for those eye tracking applications I don’t have a robust API but actually move the mouse cursor. From the cursor position the gaze can be extrapolated with a radius.

      1. Sure, that could also work! The way I wrote the code, it shouldn’t be difficult to provide an alternative implementation of the eye tracker (without even making any changes to the library).

  5. Is there any way to remove the eye tracker and just have it find something that’s on the screen? I love the rest of it with voice commands and clicking buttons. Just not the eye tracking for region finding.

  6. Yes, in fact you can do this today:

    tracker = eye_tracking.EyeTracker.get_connected_instance("")
    ocr_reader = screen_ocr.Reader.create_fast_reader(radius=10000)
    gaze_ocr_controller = gaze_ocr.Controller(ocr_reader, tracker)
    

    You will see an error “Eye tracking libraries are unavailable” which you can ignore. The larger radius is basically telling the OCR reader to read the entire screen, instead of just around the gaze point.

    Out of curiosity, why do you want to avoid eye tracking? It’s very helpful for automatically disambiguating, since you may not realize that there is another instance of the text onscreen, and this will arbitrarily choose (actually it will choose the closest to the screen center). Do you just not have an eye tracker or you don’t want to worry about where your gaze is?

    1. My parents never let me spend money on anything. THat’s a pretty big reason as well. However I just would also like to use it to find something based entirely on the screen. So both is the answer. So with this script I am not sure if you mentioned it but where is it located at in the files? I also appreciate the help.

      1. You’d use those lines in place of lines 18-20 in my original post above. That’s a full Dragonfly grammar so you can just drop that in your NatLink commands directory. Make sure first to install Dragonfly then pip install gaze-ocr and pip install screen-ocr[winrt] to get my libraries. The WinRT version of screen-ocr is critical here because it’s much faster and you need that speed if you will run OCR against the entire screen.

        1. Sorry I don’t know much about software building. Do I just pip install Dragonfly2? As currently buying the software is not going to be a possibility with me. I got Python 2.7.13, wxPython, pywin32, Natlink 4.1, papa, installed currently i believe from trying to follow the Youtube video “Installing Natlink Vocola and Dragonfly”. I am most likely doing this wrong though by not knowing about programming that’s all I know. Hoping I guess I can make it work with the requirements I installed and following a probably non-related tutorial. I just feel dumb.

  7. Hey, do you think this could be adapted to run on android to help people with mobility issues navigate their phones or tablets?

    Trying to figure out how to use this as an API for controlling some home automation apps for a friend with Muscular dystrophy.

    1. At Google we’ve built Voice Access for Android which makes it easy to click not only on text on screen, but even on common icons, which it can recognize visually using on-device models. It has a nice tutorial which will should help walk your friend through getting started. Hope that helps!

      1. Hey James, thanks for the reply! We have looked into Voice Access but it’s a bit unreliable because people with MD tend to slur their words, so using NLP only solves part of the problem sometimes. It’d be awesome if there was a way to use Voice access underlying screen image object detection and for the instructions use data fusion techniques to merge its NLP module and pupil tracking.

        1. This is helpful feedback that I can forward along. Totally understand that voice input can be difficult in this scenario, and that the underlying “smarts” could be helpful in other contexts. Can you elaborate a bit more on what kind of input interface would be ideal here? Is the idea that voice would still be the primary interface, but eye tracking would help make the system smarter about handling ambiguity when speech is slurred?

          1. That’s actually pretty close to what I’m hoping to do but the other way around with vision being the primary input and voice being supplementary. More specifically, the direction I’m trying to take would be to localize a quadrant of the screen for object detection using eye-tracking, then several actions that could be taken in the area are short-listed. The correct action could then be taken by scrolling through pop-up prompts with eye motions, blinking, and simple words (up,down, left, right, yes, back, etc.). Or for without prompts, the NLP module could be used to associate colloquial phrases spoken to the closest action in the shortlist similar to how GPT3 does things but simpler or much like how your gaze_ocr program works. I’m thinking of using the smartphone or a raspberry pi camera so it’s not going to be as accurate as say the Tobii glasses. Hopefully, the gaze tracking and NLP combination would be more accurate. I don’t know if this makes sense or could work, I’m trying to implement this as part of a home automation project for an AI course, so my understanding of the subject and knowledge of what’s feasible is still a little shaky. I really appreciate you taking the time to understand what I’m trying to do and pointing me the right way!

            1. Got it — makes sense! My library assumes that eye tracking functionality is available (it’s a lightweight wrapper over Tobii’s library) and most of the smarts are integrating that with OCR. So I think the challenge in your scenario would be getting the eye tracking working via camera — you’ll have to look into other solutions to help with that, but if you can pull that off it shouldn’t be too hard to hook that in. You should be able to run my library on Raspberry Pi without too much trouble, but Android would be more complicated (it’s a Python library). Good luck!

              1. Sweet thanks, that’s super helpful! Yeah, I’m using this <(Github repository)[https://github.com/ritko/GazeTracking]> to do the eye-tracking with a webcam right now but it shouldn’t be too difficult to switch to the pi-camera. But to hook that GazeTracking library into yours, should I make a wrapper for it by editing your eye_tracking.py file?

                1. Yep, that would work. Or you could leave that as is, create your own separate class with the same interface, and just instantiate yours when constructing gaze_ocr.Controller.

  8. Oooh never thought of doing that, definitely sounds a bit easier haha. Thanks a lot again! I’ll let you know if I can get it working for my project!

Leave a Reply to James Cancel reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.