Making writing and editing with your voice feel natural

Do you struggle to write and edit prose quickly with your voice? You’d think that this is one of the easiest things to do — just talk naturally! — and yet after years of experience I still found this mentally taxing even after other complex tasks like coding had become highly efficient (thanks to Cursorless). The difficulty was combining the fluidity of dictation with the surgical edits that are needed to make corrections and improvements — this tends to break my flow. About five months ago, I set out to make prose editing feel natural, and after many false starts and tweaks, I have a system that works well for me and I’ve made it available in the beta branch of talon-gaze-ocr (which requires Talon Beta).

Let me start with an example to illustrate the most powerful command, revise. Suppose you want to revise the previous sentence to say “the two most powerful commands”. If you use the old (but still useful) “replace” command, you either need to speak two consecutive commands (“replace most with two most” and “replace command with commands”), or you have to speak one very long command (“replace most powerful command with two most powerful commands”). With the new “revise” command, you simply say “revise with the two most powerful commands comma”. This is both shorter than the earlier alternatives and requires less mental effort: unlike with “replace”, you only have to think about the words that you want to see, not the words that you want to replace. Note that you do need to start and end the revision with words or punctuation that are already onscreen — in this case “the” and “comma” — so that the command knows where to apply the edits. I call this anchoring.

There are a couple handy variants of “revise”. If the text you want to revise is adjacent to the text cursor, then you can use “revise from” and “revise through”, which only require anchoring on one side. For example, if you want to revise text that you just dictated, instead of saying “scratch that” and dictating the entire utterance again, you can simply say “revise from” starting with a word or phrase that you want to keep (the anchor), and then continue dictating. Similarly, you can use “revise through” to replace text to the right of the cursor, anchoring when you reach text you no longer want to edit.

The other new commands, “insert with”, “append with”, and “prepend with”, all share an important property with “revise”: you only ever say the words that you want to see, in the order you want them. The most general of these is “insert”. Unlike “revise”, insert cannot be used to remove or modify existing text — it can only insert new text. Because of this constraint, you only have to anchor the text on one end. For example, to add the adjective “brand” to “new” at the start of this paragraph, simply say “insert with brand new”, and the system will both position the cursor and type “brand”. You can also anchor the text on both ends (i.e. “insert with other brand new”) and it will reduce the need for disambiguation prompts as compared to the equivalent “revise with” command, because the system will ignore any pairs of anchors with text that would have to be removed or changed. The commands “append with” and “prepend with” work the same as “insert with” except that they only allow anchoring on the left and right, respectively. These further reduce the need for disambiguation in exchange for requiring a bit more precision (and hence mental effort). Try them all and see what comes most naturally to you – I tend to stick with “insert with”.

Having used all these commands for a while, I now find prose editing far easier. Occasionally “replace” still feels like the most natural option, but often I reach for “revise” or “insert”. These new commands have a surprising number of advantages. Even for a simple one-word typo fix, “revise with” is often more reliable for a couple technical reasons: (1) it leverages the surrounding context for improved speech recognition (single-word recognition is notoriously bad, and can’t resolve homophones) and (2) OCR on the mistyped word may be degraded by a blue squiggly line suggesting a grammar fix, so it works better to refer to the text by its surrounding. But the biggest advantage is cognitive: these commands all let you focus on the text that you want to see, not the text that you want to remove. This means that you can start speaking a command before you’ve even figured out exactly what text you want to change or what you want to say. You simply say “revise with” followed by the onscreen text that looks good, and keep talking until you reach more text that looks good, and then it’ll “paint over” the old text with your new text. In order to maximize support for this kind of ad hoc editing, I ensure that if you pause too long before speaking the right-side anchor, it’ll simply fall back to “insert” behavior, and you can finish your thought with a follow-up “revise through”. I also added a variant of “revise from” that lets you revise up to the cursor even if you hadn’t originally planned it: just say “revise with <prose> cursor”. Under no circumstance will the text you dictated ever be discarded: even if we are unable to find anchors on either side, the text will be inserted at the current cursor position.

A lot of work was required under the hood to make these new commands work reliably:

  • I implemented a longest matching prefix and suffix algorithm to identify the start and end anchors, allowing you to avoid disambiguation prompts simply by speaking a longer anchor phrase.
  • Ryan implemented a new Talon Beta feature to provide timestamps for any existing capture, including <user.prose>. This is essential because these new commands interleave new prose with existing onscreen text, and I need timestamps for any onscreen text spoken so I can align them with the user’s eye gaze at that time in order to crop the screenshot. This is why Talon Beta and the beta branch of my repository are required for these new commands.
  • I improved the screenshot cropping for OCR to use the entire bounding box of the user’s eye gaze during the spoken phrases (with some padding), instead of just padding a single sample point, as I had done previously. This is essential for these new commands because the onscreen words could occur anywhere within the spoken phrase, and it has made other commands more robust as well.
  • I fixed a few other text editing hiccups and OS-specific edge cases that improve existing editing commands too.

Although this system works well the vast majority of the time, there are still some known issues to be aware of:

  • The most common source of problems is OCR issues. This is particularly bad if you are using Windows or you don’t have an eye tracker. The Windows OCR API just isn’t nearly as accurate as Mac — it tends to miss entire words or phrases, especially when underlining or squiggles are present. And on both operating systems, the performance is much better if the screenshot is cropped smaller, near your gaze (presumably due to less variance in font, foreground, and background). As always, you can check what OCR sees by saying “OCR show” while looking at the relevant text (but note that this can be sensitive to small changes in cropping).
  • I don’t yet have support for anchoring at the beginning or end of a line, in cases where you want to change every word up to that point. The workaround is to first use a command to move the cursor (e.g. “go before/after”) followed by “revise from/through” so that the cursor is used as the anchor.
  • As noted earlier, I look for the longest matching anchor phrase, and I only request disambiguation if there are multiple anchors with that same length. This is occasionally wrong: for example, you may have intended a single-word anchor but it happens that there is a two-word match elsewhere, so it chooses that instead of triggering disambiguation. The workaround to fix the bad edit is to undo and try again with a longer anchor.
  • Because the OCR system doesn’t know where the cursor is, “revise from” might foolishly try to find the left anchor after the current cursor location (“revise through” has the inverse problem). This generally just leads to unnecessary disambiguation prompts, but one time I saw it combined with the above issue of matching a longer anchor so it skipped disambiguation, and I was very perplexed by the edit it made.

I suspect my talon-gaze-ocr package is most used today to simply click on links and buttons, but it is now a great option for text editing. There are still occasional rough edges here and there, but I highly recommend trying these new commands while you dictate, especially if you use a Mac with an eye tracker. Let me know how it goes in the comments or on Slack in the #ext-gaze-ocr channel!

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.