CMU Pocket Sphinx engine back-end

This version of dragonfly contains an engine implementation using the open source, cross-platform CMU Pocket Sphinx speech recognition engine. You can read more about the CMU Sphinx speech recognition projects on the CMU Sphinx wiki.

Setup

There are three Pocket Sphinx engine dependencies:

You can install these by running the following command:

pip install 'dragonfly2[sphinx]'

If you are installing to develop dragonfly, use the following instead:

pip install -e '.[sphinx]'

Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:

python sphinx_module_loader.py

This file is the equivalent to the ‘core’ directory that NatLink uses to load grammar modules. When run, it will scan the directory it’s in for files beginning with _ and ending with .py, then try to load them as command-modules.

Cross-platform Engine

Pocket Sphinx runs on most platforms, including on architectures other than x86, so it only makes sense that the Pocket Sphinx dragonfly engine implementation should work on non-Windows platforms like macOS as well as on Linux distributions. To this effect, I’ve made an effort to mock Windows-only functionality for non-Windows platforms for the time being to allow the engine components to work correctly regardless of the platform.

Using dragonfly with a non-Windows operating system can already be done with Aenea using the existing NatLink engine. Aenea communicates with a separate Windows system running NatLink and DNS over a network connection and has server support for Linux (using X11), macOS, and Windows.

Engine configuration

This engine can be configured by changing the engine configuration.

You can make changes to the engine.config object directly in your sphinx_engine_loader.py file before connect() is called or create a config.py module in the same directory using .

The LANGUAGE option specifies the engine’s user language. This is English ("en") by default.

Audio configuration

Audio configuration options used to record from the microphone, validate input wave files in and write wave files if the training data directory is set.

These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.

  • CHANNELS – number of audio input channels (default: 1).
  • SAMPLE_WIDTH – sample width for audio input in bytes (default: 2).
  • RATE – sample rate for audio input in Hz (default: 16000).
  • FRAMES_PER_BUFFER – frames per recorded audio buffer (default: 2048).

Keyphrase configuration

The following configuration options control the engine’s built-in keyphrases:

  • WAKE_PHRASE – the keyphrase to listen for when in sleep mode (default: "wake up").
  • WAKE_PHRASE_THRESHOLD – threshold value* for the wake keyphrase (default: 1e-20).
  • SLEEP PHRASE – the keyphrase to listen for to enter sleep mode (default: "go to sleep").
  • SLEEP_PHRASE_THRESHOLD – threshold value* for the sleep keyphrase (default: 1e-40).
  • START_ASLEEP – boolean value for whether the engine should start in a sleep state (default: True).
  • START_TRAINING_PHRASE – keyphrase to listen for to start a training session where no processing occurs (default: "start training session").
  • START_TRAINING_PHRASE_THRESHOLD – threshold value* for the start training keyphrase (default: 1e-48).
  • END_TRAINING_PHRASE – keyphrase to listen for to end a training session if one is in progress (default: "end training session").
  • END_TRAINING_PHRASE_THRESHOLD – threshold value* for the end training keyphrase (default: 1e-45).

* Threshold values need to be set for each keyphrase. The CMU Sphinx LM tutorial has some advice on keyphrase threshold values.

If your language isn’t set to English, all built-in keyphrases will be disabled by default if they are not specified in your configuration.

Any keyphrase can be disabled by setting the phrase and threshold values to "" and 0 respectively.

Decoder configuration

The DECODER_CONFIG object initialised in the engine config module can be used to set various Pocket Sphinx decoder options.

The following is the default decoder configuration:

import os

from sphinxwrapper import DefaultConfig

# Configuration for the Pocket Sphinx decoder.
DECODER_CONFIG = DefaultConfig()

# Silence the decoder output by default.
DECODER_CONFIG.set_string("-logfn", os.devnull)

# Set voice activity detection configuration options for the decoder.
# You may wish to experiment with these if noise in the background
# triggers speech start and/or false recognitions (e.g. of short words)
# frequently.
# Descriptions for VAD configuration options were retrieved from:
# https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html

# Number of silence frames to keep after from speech to silence.
DECODER_CONFIG.set_int("-vad_postspeech", 30)

# Number of speech frames to keep before silence to speech.
DECODER_CONFIG.set_int("-vad_prespeech", 20)

# Number of speech frames to trigger vad from silence to speech.
DECODER_CONFIG.set_int("-vad_startspeech", 10)

# Threshold for decision between noise and silence frames.
# Log-ratio between signal level and noise level.
DECODER_CONFIG.set_float("-vad_threshold", 3.0)

There does not appear to be much documentation on these options outside of the pocketsphinx/cmdln_macro.h and sphinxbase/fe.h header files. If this is incorrect or has changed, feel free to suggest an edit.

The easiest way of seeing the available decoder options as well as their default values is to run the pocketsphinx_continuous command with no arguments.

Changing Models and Dictionaries

The DECODER_CONFIG object can be used to configure the pronunciation dictionary as well as the acoustic and language models. You can do this with something like:

DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder')
DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm')
DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')

The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.

Training configuration

The engine can save .wav and .txt training files into a directory for later use. The following are the configuration options associated with this functionality:

  • TRAINING_DATA_DIR – directory to save training files into (default: "").
  • TRANSCRIPT_NAME – common name of files saved into the training data directory (default: "training").

Set TRAINING_DATA_DIR to a valid directory path to enable recording of .txt and .wav files. If the path is a relative path, it will be interpreted as relative to the module loader’s directory.

The engine will not attempt to make the directory for you as it did in previous versions of dragonfly.

Engine API

class SphinxEngine[source]

Speech recognition engine back-end for CMU Pocket Sphinx.

DictationContainer

alias of dragonfly.engines.base.dictation.DictationContainerBase

cancel_recognition()[source]

If a recognition was in progress, cancel it before processing the next audio buffer.

check_valid_word(word)[source]

Check if a word is in the current Sphinx pronunciation dictionary.

Return type:bool
config

Python module/object containing engine configuration.

You will need to restart the engine with disconnect() and connect() if the configuration has been changed after connect() has been called.

Returns:config module/object
connect()[source]

Set up the CMU Pocket Sphinx decoder.

This method does nothing if the engine is already connected.

create_timer(callback, interval, repeating=True)[source]

Create and return a timer using the specified callback and repeat interval.

Note: Timers will not run unless the engine is recognising audio. Normal threads can be used instead with no downsides.

default_search_result

The last hypothesis object of the default search.

This does not currently reach recognition observers because it is intended to be used for dictation results, which are currently disabled. Nevertheless this object can be useful sometimes.

Returns:Sphinx Hypothesis object | None
disconnect()[source]

Deallocate the CMU Sphinx decoder and any other resources used by it.

This method effectively unloads all loaded grammars and key phrases.

end_training_session()[source]

End the training if one is in progress. This will allow recognition processing once again.

mimic(words)[source]

Mimic a recognition of the given words

mimic_phrases(*phrases)[source]

Mimic a recognition of the given phrases.

This method accepts variable phrases instead of a list of words.

pause_recognition()[source]

Pause recognition and wait for resume_recognition() to be called or for the wake keyphrase to be spoken.

process_buffer(buf)[source]

Recognise speech from an audio buffer.

This method is meant to be called in sequence for multiple audio buffers. It will do nothing if connect() hasn’t been called.

Parameters:buf (str) – audio buffer
process_wave_file(path)[source]

Recognise speech from a wave file and return the recognition results.

This method checks that the wave file is valid. It raises an error if the file doesn’t exist, if it can’t be read or if the WAV header values do not match those in the engine configuration.

If recognition is paused (sleep mode), this method will call resume_recognition().

The wave file must use the same sample width, sample rate and number of channels that the acoustic model uses.

If the file is valid, process_buffer() is then used to process the audio.

Multiple utterances are supported.

Parameters:path – wave file path
Raises:IOError | OSError | ValueError
Returns:recognition results
Return type:generator
recognising

Whether the engine is currently recognising speech.

To stop recognition, use disconnect().

Return type:bool
recognition_paused

Whether the engine is waiting for the wake phrase to be heard or for resume_recognition() to be called.

Return type:bool
resume_recognition(notify=True)[source]

Resume listening for grammar rules and key phrases.

set_exclusiveness(grammar, exclusive)[source]

Set the exclusiveness of a grammar.

set_keyphrase(keyphrase, threshold, func)[source]

Add a keyphrase to listen for.

Key phrases take precedence over grammars as they are processed first. They cannot be set for specific contexts (yet).

Parameters:
  • keyphrase (str) – keyphrase to add.
  • threshold (float) – keyphrase threshold value to use.
  • func (callable) – function or method to call when the keyphrase is heard.
Raises:

UnknownWordError

speak(text)[source]

Speak the given text using text-to-speech.

start_training_session()[source]

Start the training session. This will stop recognition processing until either end_training_session() is called or the end training keyphrase is heard.

training_session_active

Whether a training session is in progress.

Return type:bool
unset_keyphrase(keyphrase)[source]

Remove a set keyphrase so that the engine no longer listens for it.

Parameters:keyphrase (str) – keyphrase to remove.
write_transcript_files(fileids_path, transcription_path)[source]

Write .fileids and .transcription files for files in the training data directory and write them to the specified file paths.

This method will raise an error if the TRAINING_DATA_DIR configuration option is not set to an existing directory.

Parameters:
  • fileids_path (str) – path to .fileids file to create.
  • transcription_path (str) – path to .transcription file to create.
Raises:

IOError | OSError

Multiplexing interface for the CMU Pocket Sphinx engine

class SphinxTimerManager(interval, engine)[source]

Timer manager for the CMU Pocket Sphinx engine.

This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:

  • process_buffer()
  • process_wave_file()
  • recognise_forever()

Timer functions will run whether or not recognition is paused (i.e. in sleep mode).

Note: long-running timers will block dragonfly from processing what was said, so be careful with how you use them! Audio frames will not normally be dropped because of timers, long-running or otherwise.

Normal threads can be used instead of timers if desirable. This is because the main recognition loop is done in Python rather than in C/C++ code, so there are no unusual multi-threading limitations.

Improving Speech Recognition Accuracy

CMU Pocket Sphinx can have some trouble recognising what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also a YouTube video on model adaption.

Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.

The engine can record what you say into .wav and .txt files if the TRAINING_DATA_DIR configuration option mentioned above is set to an existing directory. To get files compatible with the Sphinx accoustic model adaption process, you can use the write_transcript_files() engine method.

Mismatched words may use the engine decoder’s default search, typically a language model search.

There are built-in key phrases for starting and ending training sessions where no grammar rule processing will occur. Key phrases will still be processed. See the START_TRAINING_PHRASE and END_TRAINING_PHRASE engine configuration options. One use case for the training mode is training potentially destructive commands or commands that take a long time to execute their actions.

To use the training files, you will need to correct any incorrect phrases in the .transcription or .txt files. You can then use the SphinxTrainingHelper bash script to adapt your model. This script makes the process considerably easier, although you may still encounter problems. You should be able to play the wave files using most media players (e.g. VLC, Windows Media Player, aplay) if you need to.

You will want to remove the training files after a successful adaption. This must be done manually for the moment.

Limitations

This engine has a few limitations, most notably with spoken language support and dragonfly’s Dictation functionality.

Dictation

Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.

Parts of rules that have required combinations with Dictation and other basic Dragonfly elements such as Literal, RuleRef and ListRef will not be recognised properly using this SR engine via speaking. They can, however, be recognised via the engine.mimic() method, the Mimic action or the Playback action.

Note

This engine’s previous Dictation element support using utterance breaks has been removed because it didn’t really work very well.

Unknown words

CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models and key phrases in order to recognise them. If you use words in your grammars and/or key phrases that are not in the dictionary, a message similar to the following will be printed:

grammar ‘name’ used words not found in the pronunciation dictionary: notaword

If you get a message like this, try changing the words in your grammars/key phrases by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”.

I hope to eventually have words and phoneme strings dynamically added to the current dictionary and language model using the Pocket Sphinx ps_add_word function (from Python of course).

Spoken Language Support

There are a only handful of languages with models and dictionaries available from source forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.

If the language you want to use requires non-ascii characters (e.g. a Cyrillic language), you will need to use Python version 3.4 or higher because of Unicode issues.

Dragonfly Lists and DictLists

Dragonfly Lists and DictLists function as normal, private rules for the Pocket Sphinx engine. On updating a dragonfly list or dictionary, the grammar they are part of will be reloaded. This is because there is unfortunately no JSGF equivalent for lists.

Text-to-speech

This isn’t a limitation of CMU Pocket Sphinx; text-to-speech is not a project goal for them, although as the natlink and WSR engines both support text-to-speech, there might as well be some suggestions if this functionality is desired, perhaps utilised by a custom dragonfly action.

The Jasper project contains a number of Python interface classes to popular open source text-to-speech software such as eSpeak, Festival and CMU Flite.