CMU Pocket Sphinx engine back-end

This version of dragonfly contains an engine implementation using the open source, cross-platform CMU Pocket Sphinx speech recognition engine. You can read more about the CMU Sphinx speech recognition projects on the CMU Sphinx wiki.

Setup

There are three Pocket Sphinx engine dependencies:

You can install these by running the following command:

pip install 'dragonfly2[sphinx]'

If you are installing to develop dragonfly, use the following instead:

pip install -e '.[sphinx]'

Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:

python sphinx_module_loader.py

This file is the equivalent to the ‘core’ directory that NatLink uses to load grammar modules. When run, it will scan the directory it’s in for files beginning with _ and ending with .py, then try to load them as command-modules.

Cross-platform Engine

Pocket Sphinx runs on most platforms, including on architectures other than x86, so it only makes sense that the Pocket Sphinx dragonfly engine implementation should work on non-Windows platforms like macOS as well as on Linux distributions. To this effect, I’ve made an effort to mock Windows-only functionality for non-Windows platforms for the time being to allow the engine components to work correctly regardless of the platform.

Using dragonfly with a non-Windows operating system can already be done with Aenea using the existing NatLink engine. Aenea communicates with a separate Windows system running NatLink and DNS over a network connection and has server support for Linux (using X11), macOS, and Windows.

Engine configuration

This engine can be configured by changing the engine configuration.

You can make changes to the engine.config object directly in your sphinx_engine_loader.py file before connect() is called or create a config.py module in the same directory using .

The LANGUAGE option specifies the engine’s user language. This is English ("en") by default.

Audio configuration

Audio configuration options are used to record from the microphone and to validate input wave files.

These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.

  • CHANNELS – number of audio input channels (default: 1).
  • SAMPLE_WIDTH – sample width for audio input in bytes (default: 2).
  • RATE – sample rate for audio input in Hz (default: 16000).
  • FRAMES_PER_BUFFER – frames per recorded audio buffer (default: 2048).

Decoder configuration

The DECODER_CONFIG object initialised in the engine config module can be used to set various Pocket Sphinx decoder options.

The following is the default decoder configuration:

import os

from sphinxwrapper import DefaultConfig

# Configuration for the Pocket Sphinx decoder.
DECODER_CONFIG = DefaultConfig()

# Silence the decoder output by default.
DECODER_CONFIG.set_string("-logfn", os.devnull)

# Set voice activity detection configuration options for the decoder.
# You may wish to experiment with these if noise in the background
# triggers speech start and/or false recognitions (e.g. of short words)
# frequently.
# Descriptions for VAD configuration options were retrieved from:
# https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html

# Number of silence frames to keep after from speech to silence.
DECODER_CONFIG.set_int("-vad_postspeech", 30)

# Number of speech frames to keep before silence to speech.
DECODER_CONFIG.set_int("-vad_prespeech", 20)

# Number of speech frames to trigger vad from silence to speech.
DECODER_CONFIG.set_int("-vad_startspeech", 10)

# Threshold for decision between noise and silence frames.
# Log-ratio between signal level and noise level.
DECODER_CONFIG.set_float("-vad_threshold", 3.0)

There does not appear to be much documentation on these options outside of the pocketsphinx/cmdln_macro.h and sphinxbase/fe.h header files. If this is incorrect or has changed, feel free to suggest an edit.

The easiest way of seeing the available decoder options as well as their default values is to run the pocketsphinx_continuous command with no arguments.

Changing Models and Dictionaries

The DECODER_CONFIG object can be used to configure the pronunciation dictionary as well as the acoustic and language models. You can do this with something like:

DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder')
DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm')
DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')

The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.

Engine API

class SphinxEngine[source]

Speech recognition engine back-end for CMU Pocket Sphinx.

DictationContainer

alias of dragonfly.engines.base.dictation.DictationContainerBase

cancel_recognition()[source]

If a recognition was in progress, cancel it before processing the next audio buffer.

check_valid_word(word)[source]

Check if a word is in the current Sphinx pronunciation dictionary.

Return type:bool
config

Python module/object containing engine configuration.

You will need to restart the engine with disconnect() and connect() if the configuration has been changed after connect() has been called.

Returns:config module/object
connect()[source]

Set up the CMU Pocket Sphinx decoder.

This method does nothing if the engine is already connected.

create_timer(callback, interval, repeating=True)[source]

Create and return a timer using the specified callback and repeat interval.

Note: Timers will not run unless the engine is recognising audio. Normal threads can be used instead with no downsides.

default_search_result

The last hypothesis object of the default search.

This does not currently reach recognition observers because it is intended to be used for dictation results, which are currently disabled. Nevertheless this object can be useful sometimes.

Returns:Sphinx Hypothesis object | None
disconnect()[source]

Deallocate the CMU Sphinx decoder and any other resources used by it.

This method effectively unloads all loaded grammars and key phrases.

mimic(words)[source]

Mimic a recognition of the given words

mimic_phrases(*phrases)[source]

Mimic a recognition of the given phrases.

This method accepts variable phrases instead of a list of words.

process_buffer(buf)[source]

Recognise speech from an audio buffer.

This method is meant to be called in sequence for multiple audio buffers. It will do nothing if connect() hasn’t been called.

Parameters:buf (str) – audio buffer
process_wave_file(path)[source]

Recognise speech from a wave file and return the recognition results.

This method checks that the wave file is valid. It raises an error if the file doesn’t exist, if it can’t be read or if the WAV header values do not match those in the engine configuration.

The wave file must use the same sample width, sample rate and number of channels that the acoustic model uses.

If the file is valid, process_buffer() is then used to process the audio.

Multiple utterances are supported.

Parameters:path – wave file path
Raises:IOError | OSError | ValueError
Returns:recognition results
Return type:generator
recognising

Whether the engine is currently recognising speech.

To stop recognition, use disconnect().

Return type:bool
set_exclusiveness(grammar, exclusive)[source]

Set the exclusiveness of a grammar.

set_keyphrase(keyphrase, threshold, func)[source]

Add a keyphrase to listen for.

Key phrases take precedence over grammars as they are processed first. They cannot be set for specific contexts (yet).

Parameters:
  • keyphrase (str) – keyphrase to add.
  • threshold (float) – keyphrase threshold value to use.
  • func (callable) – function or method to call when the keyphrase is heard.
Raises:

UnknownWordError

speak(text)[source]

Speak the given text using text-to-speech.

unset_keyphrase(keyphrase)[source]

Remove a set keyphrase so that the engine no longer listens for it.

Parameters:keyphrase (str) – keyphrase to remove.

Multiplexing interface for the CMU Pocket Sphinx engine

class SphinxTimerManager(interval, engine)[source]

Timer manager for the CMU Pocket Sphinx engine.

This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:

  • process_buffer()
  • process_wave_file()
  • recognise_forever()

Timer functions will run whether or not recognition is paused (i.e. in sleep mode).

Note: long-running timers will block dragonfly from processing what was said, so be careful with how you use them! Audio frames will not normally be dropped because of timers, long-running or otherwise.

Normal threads can be used instead of timers if desirable. This is because the main recognition loop is done in Python rather than in C/C++ code, so there are no unusual multi-threading limitations.

Improving Speech Recognition Accuracy

CMU Pocket Sphinx can have some trouble recognising what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also this YouTube video on model adaption and the SphinxTrainingHelper bash script.

Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.

Limitations

This engine has a few limitations, most notably with spoken language support and dragonfly’s Dictation functionality.

Dictation

Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.

Parts of rules that have required combinations with Dictation and other basic Dragonfly elements such as Literal, RuleRef and ListRef will not be recognised properly using this SR engine via speaking. They can, however, be recognised via the engine.mimic() method, the Mimic action or the Playback action.

Note

This engine’s previous Dictation element support using utterance breaks has been removed because it didn’t really work very well.

Unknown words

CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models and key phrases in order to recognise them. If you use words in your grammars and/or key phrases that are not in the dictionary, a message similar to the following will be printed:

grammar ‘name’ used words not found in the pronunciation dictionary: notaword

If you get a message like this, try changing the words in your grammars/key phrases by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”.

I hope to eventually have words and phoneme strings dynamically added to the current dictionary and language model using the Pocket Sphinx ps_add_word function (from Python of course).

Spoken Language Support

There are a only handful of languages with models and dictionaries available from source forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.

If the language you want to use requires non-ascii characters (e.g. a Cyrillic language), you will need to use Python version 3.4 or higher because of Unicode issues.

Dragonfly Lists and DictLists

Dragonfly Lists and DictLists function as normal, private rules for the Pocket Sphinx engine. On updating a dragonfly list or dictionary, the grammar they are part of will be reloaded. This is because there is unfortunately no JSGF equivalent for lists.

Text-to-speech

This isn’t a limitation of CMU Pocket Sphinx; text-to-speech is not a project goal for them, although as the natlink and WSR engines both support text-to-speech, there might as well be some suggestions if this functionality is desired, perhaps utilised by a custom dragonfly action.

The Jasper project contains a number of Python interface classes to popular open source text-to-speech software such as eSpeak, Festival and CMU Flite.