CMU Pocket Sphinx engine back-end

Dragonfly is able to use the open source CMU Pocket Sphinx speech recognition engine as an alternative to DNS/Natlink, Kaldi and WSR. The engine has a few limitations (documented below), and is generally not as accurate as the others, but it does work well enough.

The CMU Pocket Sphinx engine may be used on Windows, macOS, Linux. Dragonfly has good support for desktop enviroments on each of these operating systems. As other sections of the documentation mention, Dragonfly does not support Wayland.

Sections:

Setup

There are three Pocket Sphinx engine dependencies:

You can install these by running the following command:

pip install 'dragonfly[sphinx]'

If you are installing to develop Dragonfly, use the following instead:

pip install -e '.[sphinx]'

Note for Windows: This engine backend does not currently support Python version 3.10 or higher on Windows.

Note for Linux: You may need the portaudio headers to be installed in order to be able to install/compile the sounddevice Python package. Under apt-based distributions, you can get them by running sudo apt install portaudio19-dev. You may also need to make your user account a member of the audio group to be able to access your microphone. Do this by running usermod -a -G audio <account_name>.

Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:

python sphinx_module_loader.py

This file is the equivalent to the ‘core’ directory that NatLink uses to load grammar modules. When run, it will scan the directory it’s in for files beginning with _ and ending with .py, then try to load them as command-modules.

Engine configuration

This engine can be configured by changing the engine configuration.

You can make changes to the engine.config object directly in your sphinx_engine_loader.py file before connect() is called or create a config.py module in the same directory using .

The LANGUAGE option specifies the engine’s user language. This is English ("en") by default.

Audio configuration

Audio configuration options are used to record from the microphone and to validate input wave files.

These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.

  • CHANNELS – number of audio input channels (default: 1).

  • SAMPLE_WIDTH – sample width for audio input in bytes (default: 2).

  • FORMAT – should match the sample width (default: int16).

  • RATE – sample rate for audio input in Hz (default: 16000).

  • BUFFER_SIZE – frames per recorded audio buffer (default: 1024).

Decoder configuration

The DECODER_CONFIG object initialised in the engine config module can be used to set various Pocket Sphinx decoder options.

The following is the default decoder configuration:

import os

from sphinxwrapper import DefaultConfig

# Configuration for the Pocket Sphinx decoder.
DECODER_CONFIG = DefaultConfig()

# Silence the decoder output by default.
DECODER_CONFIG.set_string("-logfn", os.devnull)

# Set voice activity detection configuration options for the decoder.
# You may wish to experiment with these if noise in the background
# triggers speech start and/or false recognitions (e.g. of short words)
# frequently.
# Descriptions for VAD configuration options were retrieved from:
# https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html

# Number of silence frames to keep after from speech to silence.
DECODER_CONFIG.set_int("-vad_postspeech", 30)

# Number of speech frames to keep before silence to speech.
DECODER_CONFIG.set_int("-vad_prespeech", 20)

# Number of speech frames to trigger vad from silence to speech.
DECODER_CONFIG.set_int("-vad_startspeech", 10)

# Threshold for decision between noise and silence frames.
# Log-ratio between signal level and noise level.
DECODER_CONFIG.set_float("-vad_threshold", 3.0)

There does not appear to be much documentation on these options outside of the Pocket Sphinx config_macro.h header file.

The easiest way of seeing the available decoder options and their default values is to initialise a decoder and read the log output

from sphinxwrapper import PocketSphinx
PocketSphinx()

Changing Models and Dictionaries

The DECODER_CONFIG object can be used to configure the pronunciation dictionary as well as the acoustic and language models. You can do this with something like:

DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder')
DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm')
DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')

The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.

Improving Speech Recognition Accuracy

CMU Pocket Sphinx can have some trouble recognizing what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also this YouTube video on model adaption and the SphinxTrainingHelper bash script.

Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.

Limitations

This engine has a few limitations, most notably with spoken language support and dragonfly’s Dictation functionality.

Dictation

Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.

Parts of rules that have required combinations with Dictation and other basic Dragonfly elements such as Literal, RuleRef and ListRef will not be recognized properly using this SR engine via speaking. They can, however, be recognized via the engine.mimic() method, the Mimic action or the Playback action.

Unknown words

CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models in order to recognize them. If you use words in your grammars that are not in the dictionary, a message similar to the following will be printed:

grammar ‘name’ used words not found in the pronunciation dictionary: notaword

If you get a message like this, try changing the words in your grammars by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”. You may also try to add words to the pronunciation dictionary.

Spoken Language Support

There are a only handful of languages with models and dictionaries available from Source Forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.

I have tested Russian and Chinese models and dictionaries with this engine implementation. They both work, though the latter required recompiling CMU Pocket Sphinx with the FSG_PNODE_CTXT_BVSZ constant manually set to 6 or higher. There may be encoding errors, depending on Python version and platform.

Dragonfly Lists and DictLists

Dragonfly Lists and DictLists function as normal, private rules for the Pocket Sphinx engine. On updating a Dragonfly list or dictionary, the grammar they are part of will be reloaded. This is because there is unfortunately no JSGF equivalent for lists.

Engine API

class SphinxEngine[source]

Speech recognition engine back-end for CMU Pocket Sphinx.

check_valid_word(word)[source]

Check if a word is in the current Sphinx pronunciation dictionary.

Return type:

bool

property config

Python module/object containing engine configuration.

You will need to restart the engine with disconnect() and connect() if the configuration has been changed after connect() has been called.

Returns:

config module/object

connect()[source]

Set up the CMU Pocket Sphinx decoder.

This method does nothing if the engine is already connected.

create_timer(callback, interval, repeating=True)[source]

Create and return a timer using the specified callback and repeat interval.

Note

Timers only run when the engine is processing audio.

disconnect()[source]

Deallocate the CMU Sphinx decoder and any other resources used by it. If the engine is currently recognizing, the recognition loop will be terminated first.

This method unloads all loaded grammars.

mimic(words)[source]

Mimic a recognition of the given words

process_buffer(buf)[source]

Recognize speech from an audio buffer.

This method is meant to be called sequentially with buffers from an audio source, such as a microphone or wave file.

This method will do nothing if connect() has not been called.

Parameters:

buf (str) – audio buffer

process_wave_file(path)[source]

Recognize speech from a wave file and return the recognition results.

This method checks that the wave file is valid. It raises an error if the file doesn’t exist, if it can’t be read or if the relevant WAV header parameters do not match those in the engine configuration.

The wave file must use the same sample width, sample rate and number of channels that the acoustic model uses.

If the file is valid, process_buffer() is then used to process the audio.

Multiple utterances are supported.

Parameters:

path – wave file path

Raises:

IOError | OSError | ValueError

Returns:

recognition results

Return type:

generator

set_exclusiveness(grammar, exclusive)[source]

Set the exclusiveness of a grammar.

speak(text)[source]

Speak the given text using text-to-speech.

Multiplexing interface for the CMU Pocket Sphinx engine

class SphinxTimerManager(interval, engine)[source]

Timer manager for the CMU Pocket Sphinx engine.

This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:

  • process_buffer()

  • process_wave_file()

  • do_recognition()

Note

Long-running timers will block Dragonfly from processing what was said, so be careful how you use them!

Audio frames will not normally be dropped because of timers, long-running or otherwise.

Sphinx Recognition Results Class

class Results(hypothesis, type, audio_buffers)[source]

CMU Pocket Sphinx recognition results class.

property audio_buffers

The audio for this recognition, if any.

property grammar

The grammar which processed this recognition, if any.

property hypothesis

The final hypothesis for this recognition.

property recognition_type

The type of this recognition.

property rule

The rule that matched this recognition, if any.

words()[source]

Get the words for this recognition.