For the past several years, we Dragon users have had to endure increasingly poor native support for text manipulation in third-party software such as Firefox and Chrome. For a while, Select-and-Say only worked on the most recent utterance; then it stopped working entirely. As someone who writes a lot of emails, it was a pain to lose this functionality, and the workaround of transferring to a text editor is slow and messes up the formatting when composing an inline reply to someone’s email. Nuance offers the Dragon Web Extension which supposedly fixes these issues, but in practice it has earned its 2 out of 5 star rating by slowing down command recognition, occasionally hanging the browser, and not working in key web apps such as Google Docs. Over the past few months, I’ve been working to integrate Dragonfly with the accessibility APIs that Chrome and Firefox natively support, which brings this functionality back — and much more. As of today, Windows support is now available, and I’m here to tell you how to leverage it and what’s under the hood.
Installation
This functionality is available in Beta (the API may change at any time) in dictation-toolbox/dragonfly (the original Dragonfly repository is no longer maintained, so development has migrated here). As documented in the README, the easiest way to install this fork is via PIP: pip install dragonfly2
. On Windows (currently the only supported OS), the accessibility functionality depends on pyia2
, which you can get by following the installation instructions.
Next, you need to integrate the new functionality with your grammar. There are a few concepts to understand before we dive into the code. All accessibility interactions are mediated through a controller object, which can be obtained via get_accessibility_controller()
, which is exported directly by the dragonfly
module. This controller supports just a handful of methods right now for moving the cursor and selecting and replacing text. This is intentional: I want to keep the surface area of this integration small so I have freedom to change the underlying implementation as I evolve it. These methods are surprisingly powerful, though. They leverage a class called TextQuery which provides several features for specifying a range of text in the active text edit control, making it easy to pinpoint exactly which text you want to select or replace. You can read the full Python API documentation here.
Like the rest of Dragonfly, none of this functionality presupposes any particular grammar, so you will need to define your grammar and how that maps to TextQuery and the methods on the accessibility controller. Here’s an example, inspired by the Utter Command grammar style (see my previous post for more on UC):
accessibility = get_accessibility_controller()
class AccessibilityRule(MappingRule):
mapping = {
"go before <text_position_query>": Function(
lambda text_position_query: accessibility.move_cursor(
text_position_query, CursorPosition.BEFORE)),
"go after <text_position_query>": Function(
lambda text_position_query: accessibility.move_cursor(
text_position_query, CursorPosition.AFTER)),
"words <text_query>": Function(accessibility.select_text),
"words <text_query> delete": Function(
lambda text_query: accessibility.replace_text(text_query, "")),
"replace <text_query> with <replacement>": Function(
accessibility.replace_text),
}
extras = [
Dictation("replacement"),
Compound(
name="text_query",
spec=("[[([<start_phrase>] <start_relative_position> <start_relative_phrase>|<start_phrase>)] <through>] "
"([<end_phrase>] <end_relative_position> <end_relative_phrase>|<end_phrase>)"),
extras=[Dictation("start_phrase", default=""),
Alternative([Literal("before"), Literal("after")],
name="start_relative_position"),
Dictation("start_relative_phrase", default=""),
Literal("through", "through", value=True, default=False),
Dictation("end_phrase", default=""),
Alternative([Literal("before"), Literal("after")],
name="end_relative_position"),
Dictation("end_relative_phrase", default="")],
value_func=lambda node, extras: TextQuery(
start_phrase=str(extras["start_phrase"]),
start_relative_position=(CursorPosition[extras["start_relative_position"].upper()]
if "start_relative_position" in extras else None),
start_relative_phrase=str(extras["start_relative_phrase"]),
through=extras["through"],
end_phrase=str(extras["end_phrase"]),
end_relative_position=(CursorPosition[extras["end_relative_position"].upper()]
if "end_relative_position" in extras else None),
end_relative_phrase=str(extras["end_relative_phrase"]))),
Compound(
name="text_position_query",
spec="<phrase> [<relative_position> <relative_phrase>]",
extras=[Dictation("phrase", default=""),
Alternative([Literal("before"), Literal("after")],
name="relative_position"),
Dictation("relative_phrase", default="")],
value_func=lambda node, extras: TextQuery(
end_phrase=str(extras["phrase"]),
end_relative_position=(CursorPosition[extras["relative_position"].upper()]
if "relative_position" in extras else None),
end_relative_phrase=str(extras["relative_phrase"])))]
This is probably a more complicated extras
section than most folks are used to. The advantage to doing it this way is we can reuse the mini-grammars for constructing TextQuery in multiple commands, as done in the example.
Accessibility functionality can slow down Chrome, and is particularly noticeable in certain applications (e.g. Google Sheets). If possible, I recommend using Mozilla Firefox instead (and so do Google’s own docs). If you do want to use Chrome, you should also register an additional 64-bit IAccessible2 DLL which can be obtained here. Chrome detection of assistive technology is spotty, so I also recommend forcing Chrome to enable accessibility features by adding the flag --force-renderer-accessibility
to your Chrome shortcut, by right-clicking, then Properties, then adding this to the end of Target (outside any quotes).
Examples
Here are some commands that can be spoken with this grammar:
Move the cursor before a word: “go before elephant”
Move the cursor after a specific instance of a word: “go after tall before giraffe”
Select a single word: “words elephant”
Select a range of words: “words elephant through giraffe”
Select from the current cursor position up to and including a phrase: “words through large elephant”
Same as above, but stopping right before the phrase (after the preceding space): “words through before large elephant”
Delete a range of words without moving the cursor: “words elephant through giraffe delete”
Replace (or correct) a word: “replace elephant with rhino”
Replace a sequence of words: “replace elephant through giraffe with rhino”
Replace a very specific sequence of words: “replace large before elephant through rhino after angry”
The grammar also works well with punctuation:
Remove extraneous whitespace: “words space bar before elephant delete”:
Select a full sentence: “words elephant through period”
These examples illustrate several features, and they can be recombined just as you might expect. The end result is a system for manipulating text that is much more powerful and predictable than what is built into Dragon.
Note: if you want to use these features within Google Docs, you need to enable accessibility support. You can enable accessibility account-wide here by turning on Screen reader. You can also turn it on and off within a specific document using Ctrl-Alt-Z. Finally you need to enable Braille support, which you can do through the Accessibility::Settings menu once it appears in Docs.
Under the hood
Windows accessibility support has a complicated history. In 1997, Microsoft released the influential Microsoft Active Accessibility (MSAA) for Windows 95. This made Microsoft the leader of mainstream operating system support for accessibility, which arguably remains true today. Since then, however, it became clear that there were several critical missing features from this API. For example, it offered very limited support for navigating and manipulating a DOM of text nodes. In order to fill this gap, IBM built an open-standard accessibility API called IAccessible2, which worked as a set of extensions to MSAA. Linux then developed AT-SPI, based heavily on IAccessible2. Meanwhile, Microsoft developed UI Automation (UIA), which was a complete overhaul of MSAA. By the time it was released, however, developers had embraced IAccessible2, and weren’t eager to deal with some of the complications of implementing UIA, such as depending on the .NET framework. Today, both Chrome and Firefox support IAccessible2 and not UIA.
It appears that Nuance witnessed this fragmentation and threw up their hands when it came to integrating with Dragon. Dragon’s native support for Select-and-Say doesn’t use standard accessibility APIs, and is instead limited to a small list of supported text edit controls, as documented in their whitepaper. The problem with this, of course, is that many applications don’t use these controls, so support doesn’t work out-of-the-box.
In developing the accessibility integration for Dragonfly, I decided to create a higher-level API-agnostic Python interface to wrap OS-specific accessibility APIs, focusing on exposing text-related functionality. So far, I’ve written an IAccessible2 implementation, which should be easy to port to the closely-related Linux AT-SPI. The interface is based around the idea that it is convenient to use regular expressions to search text instead of having to work with a DOM tree or write a Visitor class. To that end, although it exposes a node hierarchy like most accessibility APIs, it follows the Composite pattern and also provides methods for viewing a node as a single flattened string and manipulating the node tree using indices relative to that flattened string. Arbitrary functionality can then be built atop that interface in separate OS-independent code, and ultimately exposed in the very high-level Controller interface.
I’m not sure I drew the boundaries at all the right places; for example I could make the OS-independent interface even more minimal and add the string flattening functionality in a separate tree that decorates the underlying tree. I plan to revisit this after adding more functionality and implementing Linux support, at which point the interfaces should stabilize.
Conclusion
By integrating Dragonfly with accessibility APIs, we can re-create long-broken Dragon functionality and extend it in new and interesting ways. This integration is still a work in progress, but I didn’t want to wait any longer before allowing others to enjoy it. Please test it out and let me know what you think in the comments!
Wow this looks pretty cool. I was just using Google docs the other day and getting frustrated that it could not recognize when it was in the middle of the sentence, creating extra capitalization and incorrect spacing if I didn’t dictate an entire sentence at a time. Plus mapping automatic keyboard shortcuts for finding and moving the cursor is a bit awkward in the web browser.
I’m not sure if what you wrote will help with the capitalization and spacing? Also above you mention Google Docs, Google sheets, and Gmail implicitly. Have you tested it with anything else?
Thanks!
This integration should definitely help with capitalization and spacing in general. For example, the text replacement functionality automatically preserves case (including upper-case). Unfortunately, Dragon does not make it possible to intercept arbitrary dictation, so this is only going to be useful when processing commands. For engines other than Dragon, though, this functionality can be directly used by the dictation engine. For example, I’m working on a Google API integration, and already I have an option so that it won’t dictate unless I’m in a text field, avoiding problems with accidentally dictating keyboard shortcuts in Gmail, for example.
I’ve tested this with plenty of other apps (including WordPress!). As long as the website has proper support for browser accessibility standards (which comes for free if it uses standard editing elements) it will work fine.
I finally got around to implementing this and it is great!
This should make correcting homophones easier as well – I haven’t implemented it yet, but it seems like it should be relatively straightforward to select the word using text query, instead of having to move the cursor over the word that needs to be replaced.
I noticed that in your grammar, you do not seem to be calling accessibility.stop() anywhere, but the documentation that you wrote indicates that process exit may be blocked if you do not call this. Have you actually seen that hanging happen, or is this a theoretical concern?
Thanks for the trailblazing here 🙂
Glad to hear you are enjoying this! Indeed, correcting simple misrecognitions using “replace with” is a huge timesaver.
Good sleuthing on accessibility.stop(). When using Dragonfly with Dragon, it’ll just kill Python when it closes. Also, Dragon is buggy enough that I usually rely on SpeechStart to stop and start it. I could in principle call this within unload(), but that only works because I’m using the controller from one place. Since the grammars don’t really “own” the controller, which is a singleton which in principle could be shared across multiple grammars and other places, it seemed cleaner to not have them stop it. The only place I use stop() is in my Google Speech API integration with Dragonfly, which is not yet released. In this, I start speech recognition simply by running a Python script, and it’s nice to have a command that lets it exit gracefully just by returning from the main method. In that case, the clear owner is the starter script itself, which will outlive any grammar.
Did you ever take a look at the vortex code that Mark was working on?
https://github.com/mdbridge/Vocola-2/compare/vortex
it looks like the meat of that is here:
https://github.com/mdbridge/Vocola-2/compare/vortex#diff-c8c351de09a82ee7d03fa32d080ba156
I remember that that project plugged into the Dragon Dictation object somehow to provide Select-and-Say for apps that did not implement that special recipe of Windows messages for textselection.
I’m curious of this project plugged into the same or similar interfaces?
Good question! I’m not an expert on Vortex (Mark can comment) but based on my memory of playing around with it and looking at the code right now, it appears quite different. It integrates with DictObj, which is a built-in NatLink feature that exposes the dictation system of Dragon to Python, allowing it to provide its own integration with applications. By using this, Vortex is able to regain basic text control in applications that don’t support it, but it doesn’t have full text control such that it works with pre-existing text or manual movement of the cursor. In contrast, my system bypasses Dragon entirely and uses native accessibility APIs, so that it can be used with backends other than Dragon, and allows for a complete rethinking of the dictation and correction interface.
I’m having trouble installing pyia2 which as I understand it is required for setting up this enhanced text manipulation. I got an error message when I entered regsvr32 IAccessible2Proxy.dll into administrator powershell. I got the following pop-up window titled RegSvr32:
the module “IAccessible2Proxy.dll” failed to load. Make sure the binary is stored at the specified path or debug it to check for problems with the binary or dependent .DLL files. The specified module could not be found.
Someone advised me to clone your repository here https://github.com/wolfmanstout/pyia2 and then try again, but that did not help – I still got the same error message.
Can you please advise how to proceed? Perhaps I need to be in a specific directory for this to work? Thank you very much
Yes, you need to be in the directory containing that DLL for this to work. It’s in the pyia2 subdirectory of my repository once you’ve cloned that.
thank you! I got it working this is great! so far just seems to work in Firefox and Chrome. I would love to get it working in latex editor like Lyx as mentioned in the dragonfly chat but this is great so far.
I spoke too soon this does work in Tex Maker (but not Lyx or TexStudio which both use Qt). I would’ve thought maybe the problem is that Lyx is not exactly plain text ( though I don’t exactly know what that means either – I’m very much an amateur) but I think tex studio is plaintext and it does not seem to work there.
Sometimes the accessibility commands are not working in applications where they do sometimes work. At the times when they don’t work in applications where they usually work, I see in the Natlink window the following message “WARNING:accessibility:Nothing is focused.” Is there anything that I can do about this? It seems like if it’s working sometimes in a given application that should we working always in that application. I have noticed this in Google Chrome and I think in Tex maker. Thank you.
My best guess is that either the accessibility event processing is lagging behind, or this is a bug in Chrome. If you wait, does it resolve? What if you change focus to something that is known to work well, such as the address bar?
The accessibility event processing within Dragonfly isn’t very complicated so I doubt there’s a bug there, although there is some weirdness with the way that threads are managed by NatLink that can occasionally cause deadlock, but it doesn’t sound like that’s what you are seeing.
I could be wrong, but isn’t it a lot easier to use a smart dictation box like the one from Speech Productivity (in Europe somewhere)? Such dictation boxes use rich text edit controls that give Dragon all the usual select and say capabilities. At the end, you say “transfer” to inject the text into your browser (or wherever).
Great question! For the longest time I used Notepad for this purpose and had commands that quickly opened it up and transferred text out (turns out this is actually faster than using the built-in Dragon dictation box, see earlier post for details). There are a few problems with this approach:
* Rich text doesn’t cover everything (I realize that Notepad is not rich text, but I tried other options). Plenty of websites (e.g. Gmail) embed weird things that don’t remain the same when copied to and from a rich text editor.
* It’s very nice to not leave the visual context of whatever text box you are editing. E.g. previous emails in Gmail. Seems like a minor issue but personally I found it was very significant. Starting every text editing session with an empty white page is rather distracting.
* I’ve added functionality that goes beyond Dragon’s Select-and-Say, as detailed in this post.
Certainly this was a lot of work for me to implement, but now that that’s done, this should be relatively easy to integrate once and use from then on.
Hi James, such fantastic work, thank you so much for sharing it, I use it daily and couldn’t be without it.
I have written a method for ensuring the words in the active control are weighted much higher by Dragon’s voice recognition engine, for use with all windows text/edit controls and certain editors and it has practically eliminated frustrating misrecognitions, I have also written a method to avoid erroneous capitalisation and missing spaces. With the help of your accessibility API this could help people dictating into browsers etc also, could you give an example of how to use the following functionalities please? I’m afraid Python is not my native language and I am still quite new to it and have not had much luck with them:
get_accessibility_controller()
is_editable_focused(controller)
_get_focused_text(context)
class CursorPosition(enum.Enum)
select_text(controller, query)
Thanks again and keep up the good work 🙂
Thank you for your kind words!
The big code block in this post does illustrate each of these. You should also check out the official docs for this library: https://dragonfly2.readthedocs.io/en/latest/accessibility.html Does that answer your question? If not, what are you looking for specifically?
What is the status of this these days? I know you were looking at integrating this with talon. I have somebody who was asking on one of the RSI groups about alternatives to dragon when it comes to writing and editing text. I don’t think they are a programmer so the simpler the better.
I have this working for my own setup but it breaks lots of Talon best practices (e.g. requires pip installs) so it’s not something I’m advertising widely, especially to nonprogrammers. I’m waiting on a couple Talon APIs (OCR and Levenshtein distance) to share something more widely. I definitely haven’t forgotten and I look forward to releasing this more broadly!
Thanks James. Clarification on this. This article seems to be specific to the browser accessibility APIs. I thought last time we chatted you were referring to the general accessibility APIs. ( Windows automation on Windows, accessibility API on Mac and the Linux equivalent). Is that still the case? Or is this work only focus on the browser?
This work isn’t specific to browser accessibility; it’s integrating with IAccessible2 which is a Windows accessibility API. These days if you integrate with Microsoft’s UI Automation it will automatically handle translation to IAccessible2. I haven’t integrated directly with that yet, though.