Mozilla DeepSpeech: Initial Release!

Last week, Mozilla announced the first official releases of DeepSpeech and Common Voice, their open source speech recognition system and speech dataset! They seem to have made a lot of progress on DeepSpeech in little time: they had a target of <10% word error rate and achieved 6.5%! This is a very strong result — for comparison Google boasts a 4.9% WER (albeit on different datasets). See their technical post for more details on how they pulled it off.

For this post, I’ll cover the basic information you’ll need to get it up and running on a Linux guest VM running on VirtualBox on a Windows host, since that’s my home setup. Note that the engine has not yet been integrated into any sort of real-time system, so what you’ll have at the end of this is a developers sandbox to play with — not something you can start using day-to-day. I do hope to eventually get it integrated into my daily workflow, but that’s going to take much more time.

UPDATE(12/25): If you are using Windows 10, consider running DeepSpeech natively on WSL (Windows Subsystem for Linux) instead of a VM, where you don’t have to compile from source and you’ll have faster recognition speed. Instructions here: https://fotidim.com/deepspeech-on-windows-wsl-287cb27557d4. If you run into problems with processor limitations, see info below on how to adjust CPU optimizations when compiling from source.

Installation

First, a note on requirements. Currently, the only supported operating systems are Mac and Linux (not Windows). Additionally, in order to get real-time recognition speed, you need to use a high-performance nVidia GPU (e.g. Geforce GTX 1070) on Linux (Mac not supported). Unfortunately, even though I have such a GPU, VirtualBox does not support PCI passthrough from Windows to Linux, so I can’t take advantage of it. That’s why I chose to target CPU on my Linux VM to get started. At some point I’ll also want to set up dual-boot Linux to use my GPU (and possibly start using Aenea). For more details on requirements, see the release notes.

My first attempt to get this working followed the README to install the Python package using pip. Unfortunately, this didn’t work because the pip release of DeepSpeech was built with processor optimizations that are not available on a VirtualBox VM (specifically, FMA). Instead, I had to build from source.

You should start by following these instructions to install Git LFS, clone the code, and optionally create a Python virtualenv (but stop before it says to run pip install deepspeech). Next, you’ll find the instructions to build from source here. That covers most of the details, so I’ll just describe the steps you’ll need to modify.

First, if you want to squeeze as much out of your CPU as you possibly can, you should compile DeepSpeech with all the possible CPU extensions. In order to take advantage of AVX2 in VirtualBox if your CPU supports it (first generation was Haswell), you will need to manually enable it by following these instructions. Then, you will need to modify the bazel build command when building TensorFlow and DeepSpeech. In theory, bazel is configured by TensorFlow such that it should automatically detect your CPU architecture and handle this for you, but in practice it does not. In order to fix this, simply adjust the bazel command like so:

bazel build -c opt --copt=-O3 --copt=-mavx --copt=-mavx2 --copt=-mfpmath=both --copt=-msse4.2 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so //native_client:deepspeech //native_client:deepspeech_utils //native_client:libctc_decoder_with_kenlm.so //native_client:generate_trie

Note that I did not enable FMA, which as I mentioned earlier is not supported by VirtualBox.

Testing

After you complete all the instructions for building and installing DeepSpeech, you’ll want to download the models and audio data so you have something to experiment with. You can find these on the release page.

Finally, you can test it all out!

$ python/client.py ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt ../models/lm.binary ../models/trie
Loading model from file ../models/output_graph.pb
Loaded model in 14.863s.
Loading language model from files ../models/lm.binary ../models/trie
Loaded language model in 19.973s.
Running inference.
experience proves this
Inference took 12.920s for 1.975s audio file.

If it complains about missing .so files, try adding the following to your library path:


export LD_LIBRARY_PATH="/path/to/tensorflow/bazel-bin/tensorflow:/usr/local/lib:${LD_LIBRARY_PATH}"

The very first run is usually slow so try running it again and inference should be faster. I also did tests before and after applying the CPU extensions, and found it reduced recognition time in this example from about 5.2s to 4.9s — modest, but I’ll take what I can get. I haven’t yet tried with GPU but supposedly it is faster than real-time, which is exciting.

Next, you’ll certainly want to test out your own audio. I recommend using Audacity to generate the recording. Just a few steps to generate the right format:

  1. Change your microphone recording to mono by adjusting the dropdown near the top of the window from “2 (Stereo) Recording” to “1 (Mono) Recording”.
  2. Change the “Project Rate” in the lower left of the window to 16000.
  3. Record the audio.
  4. Export as WAV using “File::Export Audio” with the default settings with type “WAV (Mirosoft) signed 16-bit PCM” (the default).

That’s it for now! Some next steps I’m planning: transcribing my own speech in real-time, and testing it out on my GPU by dual-booting Linux.

16 thoughts on “Mozilla DeepSpeech: Initial Release!”

  1. This is awesome! Thanks for the heads up. I’m excited to see what I’ll be able to do with DeepSpeech.

    Someone got it to work using WSL (Windows Subsystem for Linux), which I found much easier than using VirtualBox. WSL is the native Linux implementation Microsoft released a while back.

    https://fotidim.com/deepspeech-on-windows-wsl-287cb27557d4

    (There was only one problem. pip install deepspeech required root privileges, but I fixed that by using sudo – sudo pip install deepspeech)

    The whole installation was very easy and I believe it’s faster than VirtualBox. I still haven’t been able to get the GPU working though. I don’t think it’s possible in WSL at the moment.

    1. Thanks, that’s super helpful! Added a link near the top of this post. Unfortunately I don’t have Windows 10 on my desktop, so I can’t do an apples-to-apples performance comparison. I installed it on my laptop from source without much of a hitch though (had to do source installation because my laptop has an older CPU).

      Initially I was very excited by the possibility of having this work directly with Dragonfly on Windows, but alas I was unable to get pywin32 to install on Python within WSL (problems importing _winreg). Still, this definitely seems like the way to go if you are running Windows 10.

    1. I don’t think it’s quite ready for production use with Dragonfly, but I’m hoping it can get there soon. The biggest hurdle right now is that the DeepSpeech API doesn’t yet support streaming speech recognition, which means choosing between a long delay after an utterance or breaking the audio into smaller segments, which hurts recognition quality. Other than that, I don’t think it’s too far off. I’m currently working on integrating Dragonfly with the Google Speech API, which, like DeepSpeech (and unlike Dragon) does not use grammars. It’s already usable for basic operations. There are some gotchas related to the lack of grammar support (unusual vocabularies or word sequences are not recognized well), but if anything because DeepSpeech is open-source, this ought to be more surmountable than with the Google API. I plan to do a blog post on my Google integration once I’ve made a few more improvements (you can check out my repo if you want an early look).

    1. I plan to refactor the Windows dependencies to make this multiplatform before announcing this more widely. They are already pretty minimal — at the moment the only dependency I introduce in the diff is win10toast, and that’s purely to help with debugging and could easily be replaced on Linux with an equivalent. I’m just doing all my development on Windows because that’s what the built-in Dragonfly actions (and my grammar) are built for. I’m sure this could be made to work with Aenea and run entirely in Linux, though.

      Linux compatibility is one of the main reasons I’m doing this — and also to have a backup option in case Dragon compatibility with Natlink never gets revived after the DNS 15 breakage.

    1. Guess I messed that markdown up a little. should say “Im on a ibm x230 (included link to specs) and get similar numbers.”

      1. That figure was on the very first run before warm up. Typically it gets closer to five seconds. I have blazing fast specs, at least by the standards of two years ago: 4.4 Ghz quad-core CPU, 16 GB memory, Samsung 950 pro SSD, nVidia Geforce GTX 980.

        I got slightly faster results using Windows subsystem for Linux, which let me utilize more processor extensions. By far, though, the most usable results I got were when I set up dual-boot Ubuntu and used GPU acceleration. Then recognition was faster than real-time by a factor of about 2, from what I recall. That’s definitely my plan for using this in production, but at the moment my development environment is Windows (because Dragon) so I’m still doing testing there.

    1. Actually, the latest DeepSpeech version has streaming support in realtime for CPU! Here’s a blog post from Mozilla with a code snippet that shows how to use it:
      https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/

      I don’t think anyone has integrated this into Dragonfly yet. The recognition quality isn’t quite there yet for production usage. If I were going to integrate, I’d use my “google” Dragonfly branch as a starting point because that’s another system that doesn’t support grammars (I plan to get my changes integrated upstream into Danesprite’s Dragonfly fork — just haven’t had time).

      1. Hey James, thanks for the reply! I have read that blog post and haven’t had a chance to try it — as there’s an (possibly newer) example included anyways.

        So I’ve spent the last week digging, troubleshooting, compiling, recompiling, to get this working — deepseech/examples has a realtime mic inference example script, but getting pyaudio (which wraps portaudio) to work correctly has been a mission as well. I’m so close. It looks like my Yeti microphone doesn’t support 16k sample rate, so I’ll need to downsample from 44k somehow.
        Then, fingers crossed, it’ll work.

        I’m not familiar enough with grammars, but wouldn’t it be easy enough to make if/else clauses or have commands grouped by leading/prefix commands?

        I’m kind of intent to write my own macro voice command system from scratch if I can get deepspeech working. It’s offline, and Linux, which I need for my work computer.

        cheers,
        Jordan

      2. Update #2: It’s WORKING!

        Despite being a python technical developer in VFX, this still took me a week to get working on Kubuntu 18– between having to recompile portaudio + pyaudio, getting correct tensorflow-gpu + CUDA dependencies (thanks Anaconda) — and finally, having to modify the examples to downsample my Yeti Blue microphone from 44100 to 16000– success!
        (maybe I should start my own blog!)

        It’s not as accurate as I had hoped — maybe the downsampling is interfering with it so I’ll have to do some experimentation. But it is still at least 80% accurate for me. Enough to hopefully write my own grammar / macro system.

        1. FYI, I’m working on a full-featured, offline Kaldi backend for Dragonfly, and hope to release an alpha version for testing very soon. I’m developing it for Windows initially, but it should be relatively easy to get working on Linux (just the usual compiling/distributing binaries issues).

          (BTW, I wrote the DeepSpeech mic_vad_streaming example; I hope it was helpful despite missing some features.)

  2. Update #3.
    I am in the process of pushing a git pull request to Deepspeech for examples/mic_vad_testing.py, so that one can provide a device number, an input rate, and have it downsample automatically.

Leave a Reply

Your email address will not be published. Required fields are marked *

Markdown is supported. Make sure raw < and > are wrapped in code blocks. You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.