In my previous post, I introduced my new gaze-ocr package for easy clicking or text editing in any app or website (demo video). If you haven’t upgraded the screen-ocr and gaze-ocr packages recently, go do that now. On Windows, I’ve added support for the built-in Windows Runtime OCR, which is incredibly fast (~40X faster than Tesseract!) and also very accurate. Be sure to follow the instructions to install the necessary dependencies, which includes Python 3.7 or 3.8 (3.9 isn’t quite ready yet). NatLink now supports Python 3 (32-bit only), but you need to follow special installation instructions while it is in beta. Upgrading is worth your time: WinRT is so fast that it opens up the possibility of processing the entire screen instead of just near the gaze point — although in practice I still find it’s helpful to restrict it somewhat.
I learned about WinRT OCR thanks to a comment from Ivan on my previous post. This is why I love open source software — I always learn from others once I share my work!
In a later post, I’ll share more details on all the experiments and tweaks that have gone into making this package as robust as it is today.
2 thoughts on “Gaze OCR: New release with Windows Runtime OCR support”
A few thoughts:
A topic that would be educational would be understanding the pre-imposed / post-processing in relation to its impact on OCR results.
I’d be curious to know is how WinRT scales with different resolutions up to 4K.
What do you think the next step should be for improving code navigation not just general navigation?
My next blog post will be all about pre/postprocessing! I’ve already started the draft …
I haven’t explored 4K resolution, but my experiments indicated that processing time roughly scaled linearly with the number of pixels. For my 1080p display, full-screen processing is already pushing the limits of what’s reasonable to compute while a user completes an utterance, so 4K is likely to be problematic. That said, in practice I found that the context of eye gaze is so valuable for disambiguation that it’s really not problematic to crop the image fairly close to my gaze (I currently do a 300×300 patch).
The system already works somewhat well for code, depending on the style. There are a couple key features missing that would make it much better: 1) support for positioning the cursor within CamelCased words and 2) better support for handling punctuation that’s adjacent to words. These are closely related and will both require adding keyboard-driven cursor movement (because the OCR library does not generally provide per-character position information). These are on my roadmap, but I haven’t gone to them yet. Making this work perfectly with my fuzzy matching will be somewhat tricky, because the library I’m using is opaque about where it finds matches. It shouldn’t be too difficult to handle the common cases where fuzzy matching isn’t used, however.
The other feature I’d like to add is awareness of the cursor position. This would open up the possibility of cursor-relative movement. Often we simulate this with keyboard shortcuts, but as I’m sure you’ve noticed, applications can be inconsistent with the positioning of the cursor. I think that adding it to this system would require some additional visual processing, including handling the fact that the cursor is blinking. I’m open to ideas on this!