Enhanced text manipulation using accessibility APIs

For the past several years, we Dragon users have had to endure increasingly poor native support for text manipulation in third-party software such as Firefox and Chrome. For a while, Select-and-Say only worked on the most recent utterance; then it stopped working entirely. As someone who writes a lot of emails, it was a pain to lose this functionality, and the workaround of transferring to a text editor is slow and messes up the formatting when composing an inline reply to someone’s email. Nuance offers the Dragon Web Extension which supposedly fixes these issues, but in practice it has earned its 2 out of 5 star rating by slowing down command recognition, occasionally hanging the browser, and not working in key web apps such as Google Docs. Over the past few months, I’ve been working to integrate Dragonfly with the accessibility APIs that Chrome and Firefox natively support, which brings this functionality back — and much more. As of today, Windows support is now available, and I’m here to tell you how to leverage it and what’s under the hood.

Installation

This functionality is available in Beta (the API may change at any time) in dictation-toolbox/dragonfly (the original Dragonfly repository is no longer maintained, so development has migrated here). As documented in the README, the easiest way to install this fork is via PIP: pip install dragonfly2. On Windows (currently the only supported OS), the accessibility functionality depends on pyia2, which you can get by following the installation instructions. I have a pull request to get pyia2 to install properly using PIP, but for now I recommend adding the pyia2 subdirectory of your repository clone to the PYTHONPATH environment variable.

Next, you need to integrate the new functionality with your grammar. There are a few concepts to understand before we dive into the code. All accessibility interactions are mediated through a controller object, which can be obtained via get_accessibility_controller(), which is exported directly by the dragonfly module. This controller supports just a handful of methods right now for moving the cursor and selecting and replacing text. This is intentional: I want to keep the surface area of this integration small so I have freedom to change the underlying implementation as I evolve it. These methods are surprisingly powerful, though. They leverage a class called TextQuery which provides several features for specifying a range of text in the active text edit control, making it easy to pinpoint exactly which text you want to select or replace. You can read the full Python API documentation here.

Like the rest of Dragonfly, none of this functionality presupposes any particular grammar, so you will need to define your grammar and how that maps to TextQuery and the methods on the accessibility controller. Here’s an example, inspired by the Utter Command grammar style (see my previous post for more on UC):

accessibility = get_accessibility_controller()

class AccessibilityRule(MappingRule):
    mapping = {
        "go before <text_position_query>": Function(
            lambda text_position_query: accessibility.move_cursor(
                text_position_query, CursorPosition.BEFORE)),
        "go after <text_position_query>": Function(
            lambda text_position_query: accessibility.move_cursor(
                text_position_query, CursorPosition.AFTER)),
        "words <text_query>": Function(accessibility.select_text),
        "words <text_query> delete": Function(
            lambda text_query: accessibility.replace_text(text_query, "")),
        "replace <text_query> with <replacement>": Function(
            accessibility.replace_text),
    }
    extras = [
        Dictation("replacement"),
        Compound(
            name="text_query",
            spec=("[[([<start_phrase>] <start_relative_position> <start_relative_phrase>|<start_phrase>)] <through>] "
                  "([<end_phrase>] <end_relative_position> <end_relative_phrase>|<end_phrase>)"),
            extras=[Dictation("start_phrase", default=""),
                    Alternative([Literal("before"), Literal("after")],
                                name="start_relative_position"),
                    Dictation("start_relative_phrase", default=""),
                    Literal("through", "through", value=True, default=False),
                    Dictation("end_phrase", default=""),
                    Alternative([Literal("before"), Literal("after")],
                                name="end_relative_position"),
                    Dictation("end_relative_phrase", default="")],
            value_func=lambda node, extras: TextQuery(
                start_phrase=str(extras["start_phrase"]),
                start_relative_position=(CursorPosition[extras["start_relative_position"].upper()]
                                         if "start_relative_position" in extras else None),
                start_relative_phrase=str(extras["start_relative_phrase"]),
                through=extras["through"],
                end_phrase=str(extras["end_phrase"]),
                end_relative_position=(CursorPosition[extras["end_relative_position"].upper()]
                                       if "end_relative_position" in extras else None),
                end_relative_phrase=str(extras["end_relative_phrase"]))),
        Compound(
            name="text_position_query",
            spec="<phrase> [<relative_position> <relative_phrase>]",
            extras=[Dictation("phrase", default=""),
                    Alternative([Literal("before"), Literal("after")],
                                name="relative_position"),
                    Dictation("relative_phrase", default="")],
            value_func=lambda node, extras: TextQuery(
                end_phrase=str(extras["phrase"]),
                end_relative_position=(CursorPosition[extras["relative_position"].upper()]
                                       if "relative_position" in extras else None),
                end_relative_phrase=str(extras["relative_phrase"])))]

This is probably a more complicated extras section than most folks are used to. The advantage to doing it this way is we can reuse the mini-grammars for constructing TextQuery in multiple commands, as done in the example.

Accessibility functionality can slow down Chrome, and is particularly noticeable in certain applications (e.g. Google Sheets). If possible, I recommend using Mozilla Firefox instead (and so do Google’s own docs). If you do want to use Chrome, you should also register an additional 64-bit IAccessible2 DLL which can be obtained here. Chrome detection of assistive technology is spotty, so I also recommend forcing Chrome to enable accessibility features by adding the flag --force-renderer-accessibility to your Chrome shortcut, by right-clicking, then Properties, then adding this to the end of Target (outside any quotes).

Examples

Here are some commands that can be spoken with this grammar:

Move the cursor before a word: “go before elephant”
Move the cursor after a specific instance of a word: “go after tall before giraffe”
Select a single word: “words elephant”
Select a range of words: “words elephant through giraffe”
Select from the current cursor position up to and including a phrase: “words through large elephant”
Same as above, but stopping right before the phrase (after the preceding space): “words through before large elephant”
Delete a range of words without moving the cursor: “words elephant through giraffe delete”
Replace (or correct) a word: “replace elephant with rhino”
Replace a sequence of words: “replace elephant through giraffe with rhino”
Replace a very specific sequence of words: “replace large before elephant through rhino after angry”

The grammar also works well with punctuation:
Remove extraneous whitespace: “words space bar before elephant delete”:
Select a full sentence: “words elephant through period”

These examples illustrate several features, and they can be recombined just as you might expect. The end result is a system for manipulating text that is much more powerful and predictable than what is built into Dragon.

Note: if you want to use these features within Google Docs, you need to enable accessibility support. You can enable accessibility account-wide here by turning on Screen reader. You can also turn it on and off within a specific document using Ctrl-Alt-Z. Finally you need to enable Braille support, which you can do through the Accessibility::Settings menu once it appears in Docs.

Under the hood

Windows accessibility support has a complicated history. In 1997, Microsoft released the influential Microsoft Active Accessibility (MSAA) for Windows 95. This made Microsoft the leader of mainstream operating system support for accessibility, which arguably remains true today. Since then, however, it became clear that there were several critical missing features from this API. For example, it offered very limited support for navigating and manipulating a DOM of text nodes. In order to fill this gap, IBM built an open-standard accessibility API called IAccessible2, which worked as a set of extensions to MSAA. Linux then developed AT-SPI, based heavily on IAccessible2. Meanwhile, Microsoft developed UI Automation (UIA), which was a complete overhaul of MSAA. By the time it was released, however, developers had embraced IAccessible2, and weren’t eager to deal with some of the complications of implementing UIA, such as depending on the .NET framework. Today, both Chrome and Firefox support IAccessible2 and not UIA.

It appears that Nuance witnessed this fragmentation and threw up their hands when it came to integrating with Dragon. Dragon’s native support for Select-and-Say doesn’t use standard accessibility APIs, and is instead limited to a small list of supported text edit controls, as documented in their whitepaper. The problem with this, of course, is that many applications don’t use these controls, so support doesn’t work out-of-the-box.

In developing the accessibility integration for Dragonfly, I decided to create a higher-level API-agnostic Python interface to wrap OS-specific accessibility APIs, focusing on exposing text-related functionality. So far, I’ve written an IAccessible2 implementation, which should be easy to port to the closely-related Linux AT-SPI. The interface is based around the idea that it is convenient to use regular expressions to search text instead of having to work with a DOM tree or write a Visitor class. To that end, although it exposes a node hierarchy like most accessibility APIs, it follows the Composite pattern and also provides methods for viewing a node as a single flattened string and manipulating the node tree using indices relative to that flattened string. Arbitrary functionality can then be built atop that interface in separate OS-independent code, and ultimately exposed in the very high-level Controller interface.

I’m not sure I drew the boundaries at all the right places; for example I could make the OS-independent interface even more minimal and add the string flattening functionality in a separate tree that decorates the underlying tree. I plan to revisit this after adding more functionality and implementing Linux support, at which point the interfaces should stabilize.

Conclusion

By integrating Dragonfly with accessibility APIs, we can re-create long-broken Dragon functionality and extend it in new and interesting ways. This integration is still a work in progress, but I didn’t want to wait any longer before allowing others to enjoy it. Please test it out and let me know what you think in the comments!

4 thoughts on “Enhanced text manipulation using accessibility APIs”

  1. Wow this looks pretty cool. I was just using Google docs the other day and getting frustrated that it could not recognize when it was in the middle of the sentence, creating extra capitalization and incorrect spacing if I didn’t dictate an entire sentence at a time. Plus mapping automatic keyboard shortcuts for finding and moving the cursor is a bit awkward in the web browser.

    I’m not sure if what you wrote will help with the capitalization and spacing? Also above you mention Google Docs, Google sheets, and Gmail implicitly. Have you tested it with anything else?

    Thanks!

    1. This integration should definitely help with capitalization and spacing in general. For example, the text replacement functionality automatically preserves case (including upper-case). Unfortunately, Dragon does not make it possible to intercept arbitrary dictation, so this is only going to be useful when processing commands. For engines other than Dragon, though, this functionality can be directly used by the dictation engine. For example, I’m working on a Google API integration, and already I have an option so that it won’t dictate unless I’m in a text field, avoiding problems with accidentally dictating keyboard shortcuts in Gmail, for example.

      I’ve tested this with plenty of other apps (including WordPress!). As long as the website has proper support for browser accessibility standards (which comes for free if it uses standard editing elements) it will work fine.

  2. I finally got around to implementing this and it is great!

    This should make correcting homophones easier as well – I haven’t implemented it yet, but it seems like it should be relatively straightforward to select the word using text query, instead of having to move the cursor over the word that needs to be replaced.

    I noticed that in your grammar, you do not seem to be calling accessibility.stop() anywhere, but the documentation that you wrote indicates that process exit may be blocked if you do not call this. Have you actually seen that hanging happen, or is this a theoretical concern?

    Thanks for the trailblazing here 🙂

    1. Glad to hear you are enjoying this! Indeed, correcting simple misrecognitions using “replace with” is a huge timesaver.

      Good sleuthing on accessibility.stop(). When using Dragonfly with Dragon, it’ll just kill Python when it closes. Also, Dragon is buggy enough that I usually rely on SpeechStart to stop and start it. I could in principle call this within unload(), but that only works because I’m using the controller from one place. Since the grammars don’t really “own” the controller, which is a singleton which in principle could be shared across multiple grammars and other places, it seemed cleaner to not have them stop it. The only place I use stop() is in my Google Speech API integration with Dragonfly, which is not yet released. In this, I start speech recognition simply by running a Python script, and it’s nice to have a command that lets it exit gracefully just by returning from the main method. In that case, the clear owner is the starter script itself, which will outlive any grammar.

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.