Last week, Mozilla announced the first official releases of DeepSpeech and Common Voice, their open source speech recognition system and speech dataset! They seem to have made a lot of progress on DeepSpeech in little time: they had a target of <10% word error rate and achieved 6.5%! This is a very strong result — for comparison Google boasts a 4.9% WER (albeit on different datasets). See their technical post for more details on how they pulled it off.
For this post, I’ll cover the basic information you’ll need to get it up and running on a Linux guest VM running on VirtualBox on a Windows host, since that’s my home setup. Note that the engine has not yet been integrated into any sort of real-time system, so what you’ll have at the end of this is a developers sandbox to play with — not something you can start using day-to-day. I do hope to eventually get it integrated into my daily workflow, but that’s going to take much more time.
UPDATE(12/25): If you are using Windows 10, consider running DeepSpeech natively on WSL (Windows Subsystem for Linux) instead of a VM, where you don’t have to compile from source and you’ll have faster recognition speed. Instructions here: https://fotidim.com/deepspeech-on-windows-wsl-287cb27557d4. If you run into problems with processor limitations, see info below on how to adjust CPU optimizations when compiling from source.
First, a note on requirements. Currently, the only supported operating systems are Mac and Linux (not Windows). Additionally, in order to get real-time recognition speed, you need to use a high-performance nVidia GPU (e.g. Geforce GTX 1070) on Linux (Mac not supported). Unfortunately, even though I have such a GPU, VirtualBox does not support PCI passthrough from Windows to Linux, so I can’t take advantage of it. That’s why I chose to target CPU on my Linux VM to get started. At some point I’ll also want to set up dual-boot Linux to use my GPU (and possibly start using Aenea). For more details on requirements, see the release notes.
My first attempt to get this working followed the README to install the Python package using pip. Unfortunately, this didn’t work because the pip release of DeepSpeech was built with processor optimizations that are not available on a VirtualBox VM (specifically, FMA). Instead, I had to build from source.
You should start by following these instructions to install Git LFS, clone the code, and optionally create a Python virtualenv (but stop before it says to run
pip install deepspeech). Next, you’ll find the instructions to build from source here. That covers most of the details, so I’ll just describe the steps you’ll need to modify.
First, if you want to squeeze as much out of your CPU as you possibly can, you should compile DeepSpeech with all the possible CPU extensions. In order to take advantage of AVX2 in VirtualBox if your CPU supports it (first generation was Haswell), you will need to manually enable it by following these instructions. Then, you will need to modify the bazel build command when building TensorFlow and DeepSpeech. In theory, bazel is configured by TensorFlow such that it should automatically detect your CPU architecture and handle this for you, but in practice it does not. In order to fix this, simply adjust the bazel command like so:
bazel build -c opt –copt=-O3 –copt=-mavx –copt=-mavx2 –copt=-mfpmath=both –copt=-msse4.2 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so //native_client:deepspeech //native_client:deepspeech_utils //native_client:libctc_decoder_with_kenlm.so //native_client:generate_trie
Note that I did not enable FMA, which as I mentioned earlier is not supported by VirtualBox.
After you complete all the instructions for building and installing DeepSpeech, you’ll want to download the models and audio data so you have something to experiment with. You can find these on the release page.
Finally, you can test it all out!
$ python/client.py ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt ../models/lm.binary ../models/trie
Loading model from file ../models/output_graph.pb
Loaded model in 14.863s.
Loading language model from files ../models/lm.binary ../models/trie
Loaded language model in 19.973s.
experience proves this
Inference took 12.920s for 1.975s audio file.
If it complains about missing .so files, try adding the following to your library path:
The very first run is usually slow so try running it again and inference should be faster. I also did tests before and after applying the CPU extensions, and found it reduced recognition time in this example from about 5.2s to 4.9s — modest, but I’ll take what I can get. I haven’t yet tried with GPU but supposedly it is faster than real-time, which is exciting.
Next, you’ll certainly want to test out your own audio. I recommend using Audacity to generate the recording. Just a few steps to generate the right format:
- Change your microphone recording to mono by adjusting the dropdown near the top of the window from “2 (Stereo) Recording” to “1 (Mono) Recording”.
- Change the “Project Rate” in the lower left of the window to 16000.
- Record the audio.
- Export as WAV using “File::Export Audio” with the default settings with type “WAV (Mirosoft) signed 16-bit PCM” (the default).
That’s it for now! Some next steps I’m planning: transcribing my own speech in real-time, and testing it out on my GPU by dual-booting Linux.