Gaze OCR: Under the Hood

In my earlier post, I introduced my new gaze-ocr package for easy clicking and text editing in any app or website (demo video). It took months of experiments and tweaks to make the system as robust as it is today. In this post, I’ll take you under the hood to see how it was built and where I’d like to take it next.

The idea to combine OCR with voice commands occurred to me a long time ago, early after I got started with voice computing. We all spend so much time designing grammars that operate on functionality which is readily accessible onscreen. OCR had the promise of simply working directly with this content, bypassing the need to bind keyboard shortcuts or integrate with multiple APIs. Several years ago, I ran some rough experiments with Tesseract and decided it was too slow and inaccurate to use for this purpose. I moved on to other approaches, including browser control via WebDriver and Accessibility API integration. Despite my best efforts, though, I kept bumping up against various problems: unreliable browser behavior, slowness, and plenty of websites that didn’t work properly. Additionally, due to the uneven support for Windows accessibility APIs, none of these integrations worked well outside the browser.

Aware of major improvements in machine learning over the past several years, I decided to give OCR another try. I ran Tesseract on a clean screenshot and yet the output was still garbage. I couldn’t believe that this was the best it could do, so I decided to dig in. Tesseract’s own docs had several suggestions for improving output quality. It turns out that OCR models are generally not scale-invariant, which means that they have expectations about the size of individual letters. Since these systems are primarily designed for scanned physical documents with at least 300 dpi, they perform very poorly against screenshots, which are typically at 96 dpi. Hence, simply resizing the image by 3X produced significant improvements. There were still plenty of cases where it failed, but at least a solution seemed within striking distance.

Performance was still slow for a full screenshot, but I thought of a couple ways to mitigate this. First, I could crop the image near my eye gaze point. This would also serve the purpose of disambiguating between multiple copies of the same text onscreen. Second, I could overlap OCR processing time with the user’s speech, by triggering the screenshot and OCR as soon as an utterance began, so that the results were ready by the time the user had stopped talking.

My prototype integration worked for the simplest of cases, but still made too many mistakes to be viable for regular usage, so I kept looking into ways to improve quality. I learned that Tesseract starts with a “binarization” pass which turns the entire image into zeros and ones. The latest neural network-based model has a specific expectation of dark text (0s) on light background (1s). Tesseract’s built-in binarization method simply finds a single threshold for the entire image, which doesn’t work well with images that have a variety of foreground and background colors, as is common on webpages and apps. Some further research led me to sliding-window-based binarization methods that could infer different thresholds for different parts of the image. I combined this with a sliding window that flipped foreground and background polarity assuming there are more background pixels than foreground. More improvements resulted!

Still, text often appeared excessively bold after binarization, causing problems. Looking at the source images, I couldn’t tell why this was the case. When I visualized the different color channels of the input images, however, I noticed a surprising pattern: the text appeared to shift slightly as I flipped through the channels. As it turns out, modern operating systems leverage a feature called subpixel rendering to improve font clarity to the human eye. The idea is that the OS knows the visual order of the red green and blue subpixels on a user’s display, allowing it to cheat a bit at the edges of letters by lighting the nearest subpixel, instead of turning on the entire pixel. This trades slight discoloration at the edges of letters for higher apparent resolution, with a net improvement to appearance. For the purposes of image processing, however, most algorithms assume that the RGB channels are perfectly aligned, which causes the bolding effect I noticed in binarization. Since I was already resizing the images by 3X, there was a simple fix: shift the resized channel images by one pixel to the left or right to eliminate the offset. Sure enough, this fixed the issue I was seeing and produced better OCR results.

There were lots of other questions that arose as I sought better results: should I process each color channel independently or convert to grayscale upfront? Could I come up with a smarter way to detect foreground versus background polarity than simple statistics? What are the optimal sliding window sizes? What margins should I pad the image with? In combination, this created a huge space of possibilities, and it was not practical to explore it with one-off experiments, especially since improvements to one screenshot could hurt another. Since I already had a system that worked well enough for me to use, I started saving my screenshots and the commands that went with them. These served as a form of partially-labeled evaluation data: it’s reasonable to assume that any time I said a command like “click elephant” that the word “elephant” ought to be somewhere onscreen near my gaze point. Once I had collected several examples across a variety of apps and webpages, I decided to perform a “grid search” of all the combinations of parameters and features I had added, to see what combination worked best. The Scikit Learn grid search library made this easy, and even allowed me to automatically record processing time and find multiple useful operating points (e.g. high-quality but a bit slow or very fast with adequate quality).

As I improved OCR quality, I began noticing problems with other aspects of the system. Since Dragon doesn’t know what what words are onscreen, it would often use an alternative spelling or miss a small detail like a plural “s”. Fortunately, there are fuzzy matching libraries for Python which make it easy to handle small corrections. I settled on RapidFuzz (thanks Derrick!), which let me search for a partial fuzzy match within a word, so that I could simultaneously handle minor spelling variations and subword matching (e.g. “click binarize” can click “binarise” and “click rapid” can click “RapidFuzz”).

After releasing this publicly in my earlier blog post, I learned about more OCR backends, namely Windows Runtime OCR and EasyOCR. Fortunately, with my evaluation system in place, I was able to easily include these in my grid search to see how they compared and if they could benefit from any of my preprocessing ideas. For the most part, these did not need much preprocessing, although a 2X resize did noticeably improve WinRT OCR quality. As described in my previous post, WinRT OCR is now my recommended OCR engine for Windows users, with Tesseract acting as an excellent option for other platforms or as an open source alternative. EasyOCR is also very promising, with accuracy that surpasses either of these. Last I tested it, however, it runs more slowly than these other options (even when accelerated by my fast NVIDIA RTX 2700S GPU). It also depends on 64-bit Python (because of its use of PyTorch) making it incompatible with NatLink if called directly.

With all these improvements in place, this system has become an indispensable part of my setup. That said, there are still plenty more improvements worth making — more than I will have time for in the near future. If you are interested in contributing, please reach out in the comments section. Here are some features I’d love to add:

  • Fine-grained cursor placement using keypresses. Neither Tesseract nor WinRT OCR provide the individual locations of characters, so I don’t know where to click when the user refers to a word that is adjacent to punctuation or another word. For example, I’d like to say “go after Rapid” to place the cursor in the middle of “RapidFuzz”, or “go before period” to place the cursor within “end of the sentence.” This can be solved by combining the mouse click with arrow button presses. Making this work perfectly with my fuzzy matching will be somewhat tricky, because the library I’m using doesn’t return where it finds matches. It shouldn’t be too difficult to handle the common cases where fuzzy matching isn’t needed, however.
  • Text cursor position awareness. This would make it possible to return the cursor to its previous position after an edit. It would also open up the possibility of cursor-relative movement (e.g. moving before/after the current word). We typically achieve this with keybindings (e.g. ctrl-left/right), but often applications and websites can be inconsistent with the positioning of the cursor. I think that solving this within my system would require some additional visual processing, including handling the fact that the cursor is blinking, but I’m open to other ways to solve this.
  • Homonyms (e.g. “write” and “right”). This could be achieved by trying multiple candidates from a dictionary lookup, or by normalizing the onscreen and query text. Theoretically, it could also be solved by adding OCR results to a dynamic list, but this would require additional OCR runs because it needs to be done before the start of the utterance.
  • Number formatting variations. This is similar to homonyms, where the numeric format (e.g. “12”) and written format (e.g. “twelve”) should be interchangeable.
  • Handle query phrases instead of only single words. This is conceptually simple but a bit technically tricky to get working properly with the OCR data structure and fuzzy matching.
  • Better ways of referring to words that are difficult to pronounce (or are misspelled beyond fuzzy matching thresholds). For example, one could say “the word after Arnold” to operate on the text “Arnold Schwarzenegger”. This is more of a grammar feature than a change to underlying library functionality, but still useful.
  • Improve OCR when red squigglies are present on misspelled words in a text editor. This is currently the leading cause of OCR mistakes, and yet it’s often text that I want to edit.

If you have other ideas, please share them in the comments!

6 thoughts on “Gaze OCR: Under the Hood”

  1. Great work! Looking forward to playing with it and adapting to my setup.

    Regarding homonyms, have you tried stuffing all of the recognised words into a dynamic List? This would allow you to constrain the words that can be recognised by Dragon to those that are on the screen.

    Only potential issue I can see is performance: loading large lists into Dragon can be slow, particularly with Natlink which IIRC only allows you to add one item to a list at a time.

    1. Thanks Mike! Would love to get your feedback.

      I have not yet tried that approach. There is another challenge (also performance-related), which is that currently I run OCR when the user starts speaking, so I only have the results ready after the utterance has been parsed. In order to add those words to a dynamic list, I would need to eagerly run the OCR and update the lists before the start of utterance. That said, it could work to just do both and reap the benefits when these happen to be aligned. For example, run OCR against the entire screenshot every few seconds and feed in all the words. It would probably need some tuning to work well, but I think it’s definitely worth trying (and measuring CPU impact).

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.