CMU Pocket Sphinx engine back-end¶
This version of dragonfly contains an engine implementation using the open source, cross-platform CMU Pocket Sphinx speech recognition engine. You can read more about the CMU Sphinx speech recognition projects on the CMU Sphinx wiki.
There are three Pocket Sphinx engine dependencies:
You can install these by running the following command:
pip install 'dragonfly2[sphinx]'
If you are installing to develop dragonfly, use the following instead:
pip install -e '.[sphinx]'
Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:
This file is the equivalent to the ‘core’ directory that NatLink uses to
load grammar modules. When run, it will scan the directory it’s in for files
_ and ending with
.py, then try to load them as
Pocket Sphinx runs on most platforms, including on architectures other than x86, so it only makes sense that the Pocket Sphinx dragonfly engine implementation should work on non-Windows platforms like macOS as well as on Linux distributions. To this effect, I’ve made an effort to mock Windows-only functionality for non-Windows platforms for the time being to allow the engine components to work correctly regardless of the platform.
Using dragonfly with a non-Windows operating system can already be done with Aenea using the existing NatLink engine. Aenea communicates with a separate Windows system running NatLink and DNS over a network connection and has server support for Linux (using X11), macOS, and Windows.
This engine can be configured by changing the engine configuration.
You can make changes to the
engine.config object directly in your
sphinx_engine_loader.py file before
connect() is called or create a
config.py module in the same directory using .
LANGUAGE option specifies the engine’s user language. This is
"en") by default.
Audio configuration options used to record from the microphone, validate input wave files in and write wave files if the training data directory is set.
These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.
CHANNELS– number of audio input channels (default:
SAMPLE_WIDTH– sample width for audio input in bytes (default:
RATE– sample rate for audio input in Hz (default:
FRAMES_PER_BUFFER– frames per recorded audio buffer (default:
The following configuration options control the engine’s built-in keyphrases:
WAKE_PHRASE– the keyphrase to listen for when in sleep mode (default:
WAKE_PHRASE_THRESHOLD– threshold value* for the wake keyphrase (default:
SLEEP PHRASE– the keyphrase to listen for to enter sleep mode (default:
"go to sleep").
SLEEP_PHRASE_THRESHOLD– threshold value* for the sleep keyphrase (default:
START_ASLEEP– boolean value for whether the engine should start in a sleep state (default:
START_TRAINING_PHRASE– keyphrase to listen for to start a training session where no processing occurs (default:
"start training session").
START_TRAINING_PHRASE_THRESHOLD– threshold value* for the start training keyphrase (default:
END_TRAINING_PHRASE– keyphrase to listen for to end a training session if one is in progress (default:
"end training session").
END_TRAINING_PHRASE_THRESHOLD– threshold value* for the end training keyphrase (default:
* Threshold values need to be set for each keyphrase. The CMU Sphinx LM tutorial has some advice on keyphrase threshold values.
If your language isn’t set to English, all built-in keyphrases will be disabled by default if they are not specified in your configuration.
Any keyphrase can be disabled by setting the phrase and threshold values to
DECODER_CONFIG object initialised in the engine config module can be
used to set various Pocket Sphinx decoder options.
The following is the default decoder configuration:
import os from sphinxwrapper import DefaultConfig # Configuration for the Pocket Sphinx decoder. DECODER_CONFIG = DefaultConfig() # Silence the decoder output by default. DECODER_CONFIG.set_string("-logfn", os.devnull) # Set voice activity detection configuration options for the decoder. # You may wish to experiment with these if noise in the background # triggers speech start and/or false recognitions (e.g. of short words) # frequently. # Descriptions for VAD configuration options were retrieved from: # https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html # Number of silence frames to keep after from speech to silence. DECODER_CONFIG.set_int("-vad_postspeech", 30) # Number of speech frames to keep before silence to speech. DECODER_CONFIG.set_int("-vad_prespeech", 20) # Number of speech frames to trigger vad from silence to speech. DECODER_CONFIG.set_int("-vad_startspeech", 10) # Threshold for decision between noise and silence frames. # Log-ratio between signal level and noise level. DECODER_CONFIG.set_float("-vad_threshold", 3.0)
There does not appear to be much documentation on these options outside of the pocketsphinx/cmdln_macro.h and sphinxbase/fe.h header files. If this is incorrect or has changed, feel free to suggest an edit.
The easiest way of seeing the available decoder options as well as their
default values is to run the
pocketsphinx_continuous command with no
Changing Models and Dictionaries¶
DECODER_CONFIG object can be used to configure the pronunciation
dictionary as well as the acoustic and language models. You can do this with
DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder') DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm') DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')
The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.
The engine can save .wav and .txt training files into a directory for later use. The following are the configuration options associated with this functionality:
TRAINING_DATA_DIR– directory to save training files into (default:
TRANSCRIPT_NAME– common name of files saved into the training data directory (default:
TRAINING_DATA_DIR to a valid directory path to enable recording of
.txt and .wav files. If the path is a relative path, it will be
interpreted as relative to the module loader’s directory.
The engine will not attempt to make the directory for you as it did in previous versions of dragonfly.
Speech recognition engine back-end for CMU Pocket Sphinx.
If a recognition was in progress, cancel it before processing the next audio buffer.
Check if a word is in the current Sphinx pronunciation dictionary.
Return type: bool
Python module/object containing engine configuration.
Returns: config module/object
Set up the CMU Pocket Sphinx decoder.
This method does nothing if the engine is already connected.
create_timer(callback, interval, repeating=True)¶
Create and return a timer using the specified callback and repeat interval.
Note: Timers will not run unless the engine is recognising audio. Normal threads can be used instead with no downsides.
The last hypothesis object of the default search.
This does not currently reach recognition observers because it is intended to be used for dictation results, which are currently disabled. Nevertheless this object can be useful sometimes.
Returns: Sphinx Hypothesis object | None
Deallocate the CMU Sphinx decoder and any other resources used by it.
This method effectively unloads all loaded grammars and key phrases.
End the training if one is in progress. This will allow recognition processing once again.
Mimic a recognition of the given words
Mimic a recognition of the given phrases.
This method accepts variable phrases instead of a list of words.
Pause recognition and wait for
resume_recognition()to be called or for the wake keyphrase to be spoken.
Recognise speech from an audio buffer.
This method is meant to be called in sequence for multiple audio buffers. It will do nothing if
connect()hasn’t been called.
Parameters: buf (str) – audio buffer
Recognise speech from a wave file and return the recognition results.
This method checks that the wave file is valid. It raises an error if the file doesn’t exist, if it can’t be read or if the WAV header values do not match those in the engine configuration.
If recognition is paused (sleep mode), this method will call
The wave file must use the same sample width, sample rate and number of channels that the acoustic model uses.
If the file is valid,
process_buffer()is then used to process the audio.
Multiple utterances are supported.
Parameters: path – wave file path Raises: IOError | OSError | ValueError Returns: recognition results Return type: generator
Whether the engine is currently recognising speech.
To stop recognition, use
Return type: bool
Whether the engine is waiting for the wake phrase to be heard or for
resume_recognition()to be called.
Return type: bool
Resume listening for grammar rules and key phrases.
Set the exclusiveness of a grammar.
set_keyphrase(keyphrase, threshold, func)¶
Add a keyphrase to listen for.
Key phrases take precedence over grammars as they are processed first. They cannot be set for specific contexts (yet).
- keyphrase (str) – keyphrase to add.
- threshold (float) – keyphrase threshold value to use.
- func (callable) – function or method to call when the keyphrase is heard.
Speak the given text using text-to-speech.
Start the training session. This will stop recognition processing until either
end_training_session()is called or the end training keyphrase is heard.
Whether a training session is in progress.
Return type: bool
Remove a set keyphrase so that the engine no longer listens for it.
Parameters: keyphrase (str) – keyphrase to remove.
Write .fileids and .transcription files for files in the training data directory and write them to the specified file paths.
This method will raise an error if the
TRAINING_DATA_DIRconfiguration option is not set to an existing directory.
- fileids_path (str) – path to .fileids file to create.
- transcription_path (str) – path to .transcription file to create.
IOError | OSError
Multiplexing interface for the CMU Pocket Sphinx engine¶
Timer manager for the CMU Pocket Sphinx engine.
This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:
Timer functions will run whether or not recognition is paused (i.e. in sleep mode).
Note: long-running timers will block dragonfly from processing what was said, so be careful with how you use them! Audio frames will not normally be dropped because of timers, long-running or otherwise.
Normal threads can be used instead of timers if desirable. This is because the main recognition loop is done in Python rather than in C/C++ code, so there are no unusual multi-threading limitations.
Improving Speech Recognition Accuracy¶
CMU Pocket Sphinx can have some trouble recognising what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also a YouTube video on model adaption.
Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.
The engine can record what you say into .wav and .txt files if the
TRAINING_DATA_DIR configuration option mentioned above is set to an
existing directory. To get files compatible with the Sphinx accoustic model
adaption process, you can use the
Mismatched words may use the engine decoder’s default search, typically a language model search.
There are built-in key phrases for starting and ending training sessions
where no grammar rule processing will occur. Key phrases will still be
processed. See the
engine configuration options. One use case for the training mode is training
potentially destructive commands or commands that take a long time to
execute their actions.
To use the training files, you will need to correct any incorrect phrases in the .transcription or .txt files. You can then use the SphinxTrainingHelper bash script to adapt your model. This script makes the process considerably easier, although you may still encounter problems. You should be able to play the wave files using most media players (e.g. VLC, Windows Media Player, aplay) if you need to.
You will want to remove the training files after a successful adaption. This must be done manually for the moment.
This engine has a few limitations, most notably with spoken language support
Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.
Parts of rules that have required combinations with
other basic Dragonfly elements such as
ListRef will not be recognised properly using this SR engine via
speaking. They can, however, be recognised via the
Mimic action or the
This engine’s previous
Dictation element support using utterance
breaks has been removed because it didn’t really work very well.
CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models and key phrases in order to recognise them. If you use words in your grammars and/or key phrases that are not in the dictionary, a message similar to the following will be printed:
grammar ‘name’ used words not found in the pronunciation dictionary: notaword
If you get a message like this, try changing the words in your grammars/key phrases by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”.
I hope to eventually have words and phoneme strings dynamically added to the current dictionary and language model using the Pocket Sphinx ps_add_word function (from Python of course).
Spoken Language Support¶
There are a only handful of languages with models and dictionaries available from source forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.
If the language you want to use requires non-ascii characters (e.g. a Cyrillic language), you will need to use Python version 3.4 or higher because of Unicode issues.
Dragonfly Lists and DictLists¶
DictLists function as normal, private
rules for the Pocket Sphinx engine. On updating a dragonfly list or
dictionary, the grammar they are part of will be reloaded. This is because
there is unfortunately no JSGF equivalent for lists.
This isn’t a limitation of CMU Pocket Sphinx; text-to-speech is not a project goal for them, although as the natlink and WSR engines both support text-to-speech, there might as well be some suggestions if this functionality is desired, perhaps utilised by a custom dragonfly action.