CMU Pocket Sphinx engine back-end¶
This version of dragonfly contains an engine implementation using the open source, cross-platform CMU Pocket Sphinx speech recognition engine. You can read more about the CMU Sphinx speech recognition projects on the CMU Sphinx wiki.
There are three Pocket Sphinx engine dependencies:
You can install these by running the following command:
pip install 'dragonfly2[sphinx]'
If you are installing to develop dragonfly, use the following instead:
pip install -e '.[sphinx]'
Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:
This file is the equivalent to the ‘core’ directory that NatLink uses to
load grammar modules. When run, it will scan the directory it’s in for files
_ and ending with
.py, then try to load them as
Pocket Sphinx runs on most platforms, including on architectures other than x86, so it only makes sense that the Pocket Sphinx dragonfly engine implementation should work on non-Windows platforms like macOS as well as on Linux distributions. To this effect, I’ve made an effort to mock Windows-only functionality for non-Windows platforms for the time being to allow the engine components to work correctly regardless of the platform.
Using dragonfly with a non-Windows operating system can already be done with Aenea using the existing NatLink engine. Aenea communicates with a separate Windows system running NatLink and DNS over a network connection and has server support for Linux (using X11), macOS, and Windows.
This engine can be configured by changing the engine configuration.
You can make changes to the
engine.config object directly in your
sphinx_engine_loader.py file before
connect() is called or create a
config.py module in the same directory using .
LANGUAGE option specifies the engine’s user language. This is
"en") by default.
Audio configuration options are used to record from the microphone and to validate input wave files.
These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.
CHANNELS– number of audio input channels (default:
SAMPLE_WIDTH– sample width for audio input in bytes (default:
RATE– sample rate for audio input in Hz (default:
FRAMES_PER_BUFFER– frames per recorded audio buffer (default:
DECODER_CONFIG object initialised in the engine config module can be
used to set various Pocket Sphinx decoder options.
The following is the default decoder configuration:
import os from sphinxwrapper import DefaultConfig # Configuration for the Pocket Sphinx decoder. DECODER_CONFIG = DefaultConfig() # Silence the decoder output by default. DECODER_CONFIG.set_string("-logfn", os.devnull) # Set voice activity detection configuration options for the decoder. # You may wish to experiment with these if noise in the background # triggers speech start and/or false recognitions (e.g. of short words) # frequently. # Descriptions for VAD configuration options were retrieved from: # https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html # Number of silence frames to keep after from speech to silence. DECODER_CONFIG.set_int("-vad_postspeech", 30) # Number of speech frames to keep before silence to speech. DECODER_CONFIG.set_int("-vad_prespeech", 20) # Number of speech frames to trigger vad from silence to speech. DECODER_CONFIG.set_int("-vad_startspeech", 10) # Threshold for decision between noise and silence frames. # Log-ratio between signal level and noise level. DECODER_CONFIG.set_float("-vad_threshold", 3.0)
There does not appear to be much documentation on these options outside of the pocketsphinx/cmdln_macro.h and sphinxbase/fe.h header files. If this is incorrect or has changed, feel free to suggest an edit.
The easiest way of seeing the available decoder options as well as their
default values is to run the
pocketsphinx_continuous command with no
Changing Models and Dictionaries¶
DECODER_CONFIG object can be used to configure the pronunciation
dictionary as well as the acoustic and language models. You can do this with
DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder') DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm') DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')
The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.
Speech recognition engine back-end for CMU Pocket Sphinx.
If a recognition was in progress, cancel it before processing the next audio buffer.
Check if a word is in the current Sphinx pronunciation dictionary.
Return type: bool
Python module/object containing engine configuration.
Returns: config module/object
Set up the CMU Pocket Sphinx decoder.
This method does nothing if the engine is already connected.
create_timer(callback, interval, repeating=True)¶
Create and return a timer using the specified callback and repeat interval.
Note: Timers will not run unless the engine is recognising audio. Normal threads can be used instead with no downsides.
The last hypothesis object of the default search.
This does not currently reach recognition observers because it is intended to be used for dictation results, which are currently disabled. Nevertheless this object can be useful sometimes.
Returns: Sphinx Hypothesis object | None
Deallocate the CMU Sphinx decoder and any other resources used by it.
This method effectively unloads all loaded grammars and key phrases.
Mimic a recognition of the given words
Mimic a recognition of the given phrases.
This method accepts variable phrases instead of a list of words.
Recognise speech from an audio buffer.
This method is meant to be called in sequence for multiple audio buffers. It will do nothing if
connect()hasn’t been called.
Parameters: buf (str) – audio buffer
Recognise speech from a wave file and return the recognition results.
This method checks that the wave file is valid. It raises an error if the file doesn’t exist, if it can’t be read or if the WAV header values do not match those in the engine configuration.
The wave file must use the same sample width, sample rate and number of channels that the acoustic model uses.
If the file is valid,
process_buffer()is then used to process the audio.
Multiple utterances are supported.
Parameters: path – wave file path Raises: IOError | OSError | ValueError Returns: recognition results Return type: generator
Whether the engine is currently recognising speech.
To stop recognition, use
Return type: bool
Set the exclusiveness of a grammar.
set_keyphrase(keyphrase, threshold, func)¶
Add a keyphrase to listen for.
Key phrases take precedence over grammars as they are processed first. They cannot be set for specific contexts (yet).
- keyphrase (str) – keyphrase to add.
- threshold (float) – keyphrase threshold value to use.
- func (callable) – function or method to call when the keyphrase is heard.
Speak the given text using text-to-speech.
Remove a set keyphrase so that the engine no longer listens for it.
Parameters: keyphrase (str) – keyphrase to remove.
Multiplexing interface for the CMU Pocket Sphinx engine¶
Timer manager for the CMU Pocket Sphinx engine.
This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:
Timer functions will run whether or not recognition is paused (i.e. in sleep mode).
Note: long-running timers will block dragonfly from processing what was said, so be careful with how you use them! Audio frames will not normally be dropped because of timers, long-running or otherwise.
Normal threads can be used instead of timers if desirable. This is because the main recognition loop is done in Python rather than in C/C++ code, so there are no unusual multi-threading limitations.
Improving Speech Recognition Accuracy¶
CMU Pocket Sphinx can have some trouble recognising what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also this YouTube video on model adaption and the SphinxTrainingHelper bash script.
Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.
This engine has a few limitations, most notably with spoken language support
Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.
Parts of rules that have required combinations with
other basic Dragonfly elements such as
ListRef will not be recognised properly using this SR engine via
speaking. They can, however, be recognised via the
Mimic action or the
This engine’s previous
Dictation element support using utterance
breaks has been removed because it didn’t really work very well.
CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models and key phrases in order to recognise them. If you use words in your grammars and/or key phrases that are not in the dictionary, a message similar to the following will be printed:
grammar ‘name’ used words not found in the pronunciation dictionary: notaword
If you get a message like this, try changing the words in your grammars/key phrases by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”.
I hope to eventually have words and phoneme strings dynamically added to the current dictionary and language model using the Pocket Sphinx ps_add_word function (from Python of course).
Spoken Language Support¶
There are a only handful of languages with models and dictionaries available from source forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.
If the language you want to use requires non-ascii characters (e.g. a Cyrillic language), you will need to use Python version 3.4 or higher because of Unicode issues.
Dragonfly Lists and DictLists¶
DictLists function as normal, private
rules for the Pocket Sphinx engine. On updating a dragonfly list or
dictionary, the grammar they are part of will be reloaded. This is because
there is unfortunately no JSGF equivalent for lists.
This isn’t a limitation of CMU Pocket Sphinx; text-to-speech is not a project goal for them, although as the natlink and WSR engines both support text-to-speech, there might as well be some suggestions if this functionality is desired, perhaps utilised by a custom dragonfly action.