CMU Pocket Sphinx engine back-end
Dragonfly is able to use the open source CMU Pocket Sphinx speech recognition engine as an alternative to DNS/Natlink, Kaldi and WSR. The engine has a few limitations (documented below), and is generally not as accurate as the others, but it does work well enough.
The CMU Pocket Sphinx engine may be used on Windows, macOS, Linux. Dragonfly has good support for desktop enviroments on each of these operating systems. As other sections of the documentation mention, Dragonfly does not support Wayland.
Sections:
Setup
There are three Pocket Sphinx engine dependencies:
You can install these by running the following command:
pip install 'dragonfly[sphinx]'
If you are installing to develop Dragonfly, use the following instead:
pip install -e '.[sphinx]'
Note for Windows: This engine backend does not currently support Python version 3.10 or higher on Windows.
Note for Linux: You may need the portaudio
headers to be
installed in order to be able to install/compile the sounddevice
Python package. Under apt
-based distributions, you can get them by
running sudo apt install portaudio19-dev
. You may also need to make
your user account a member of the audio
group to be able to access
your microphone. Do this by running usermod -a -G audio
<account_name>
.
Once the dependencies are installed, you’ll need to copy the dragonfly/examples/sphinx_module_loader.py script into the folder with your grammar modules and run it using:
python sphinx_module_loader.py
This file is the equivalent to the ‘core’ directory that NatLink uses
to load grammar modules. When run, it will scan the directory it’s in
for files beginning with _
and ending with .py
, then try to
load them as command-modules.
Engine configuration
This engine can be configured by changing the engine configuration.
You can make changes to the engine.config
object directly in your
sphinx_engine_loader.py file before connect()
is called or
create a config.py module in the same directory using .
The LANGUAGE
option specifies the engine’s user language. This is
English ("en"
) by default.
Audio configuration
Audio configuration options are used to record from the microphone and to validate input wave files.
These options must match the requirements for the acoustic model being used. The default values match the requirements for the 16kHz CMU US English models.
CHANNELS
– number of audio input channels (default:1
).SAMPLE_WIDTH
– sample width for audio input in bytes (default:2
).FORMAT
– should match the sample width (default:int16
).RATE
– sample rate for audio input in Hz (default:16000
).BUFFER_SIZE
– frames per recorded audio buffer (default:1024
).
Decoder configuration
The DECODER_CONFIG
object initialised in the engine config module
can be used to set various Pocket Sphinx decoder options.
The following is the default decoder configuration:
import os
from sphinxwrapper import DefaultConfig
# Configuration for the Pocket Sphinx decoder.
DECODER_CONFIG = DefaultConfig()
# Silence the decoder output by default.
DECODER_CONFIG.set_string("-logfn", os.devnull)
# Set voice activity detection configuration options for the decoder.
# You may wish to experiment with these if noise in the background
# triggers speech start and/or false recognitions (e.g. of short words)
# frequently.
# Descriptions for VAD configuration options were retrieved from:
# https://cmusphinx.github.io/doc/sphinxbase/fe_8h_source.html
# Number of silence frames to keep after from speech to silence.
DECODER_CONFIG.set_int("-vad_postspeech", 30)
# Number of speech frames to keep before silence to speech.
DECODER_CONFIG.set_int("-vad_prespeech", 20)
# Number of speech frames to trigger vad from silence to speech.
DECODER_CONFIG.set_int("-vad_startspeech", 10)
# Threshold for decision between noise and silence frames.
# Log-ratio between signal level and noise level.
DECODER_CONFIG.set_float("-vad_threshold", 3.0)
There does not appear to be much documentation on these options outside of the Pocket Sphinx config_macro.h header file.
The easiest way of seeing the available decoder options and their default values is to initialise a decoder and read the log output
from sphinxwrapper import PocketSphinx
PocketSphinx()
Changing Models and Dictionaries
The DECODER_CONFIG
object can be used to configure the
pronunciation dictionary as well as the acoustic and language models.
You can do this with something like:
DECODER_CONFIG.set_string('-hmm', '/path/to/acoustic-model-folder')
DECODER_CONFIG.set_string('-lm', '/path/to/lm-file.lm')
DECODER_CONFIG.set_string('-dict', '/path/to/dictionary-file.dict')
The language model, acoustic model and pronunciation dictionary should all use the same language or language variant. See the CMU Sphinx wiki for a more detailed explanation of these components.
Improving Speech Recognition Accuracy
CMU Pocket Sphinx can have some trouble recognizing what was said accurately. To remedy this, you may need to adapt the acoustic model that Pocket Sphinx is using. This is similar to how Dragon sometimes requires training. The CMU Sphinx adaption tutorial covers this topic. There is also this YouTube video on model adaption and the SphinxTrainingHelper bash script.
Adapting your model may not be necessary; there might be other issues with your setup. There is more information on tuning the recognition accuracy in the CMU Sphinx tuning tutorial.
Limitations
This engine has a few limitations, most notably with spoken language support
and dragonfly’s Dictation
functionality.
Dictation
Mixing free-form dictation with grammar rules is difficult with the CMU Sphinx decoders. It is either dictation or grammar rules, not both. For this reason, Dragonfly’s CMU Pocket Sphinx SR engine supports speaking free-form dictation, but only on its own.
Parts of rules that have required combinations with Dictation
and
other basic Dragonfly elements such as Literal
, RuleRef
and ListRef
will not be recognized properly using this SR engine
via speaking. They can, however, be recognized via the engine.mimic()
method, the Mimic
action or the Playback
action.
Unknown words
CMU Pocket Sphinx uses pronunciation dictionaries to lookup phonetic representations for words in grammars, language models in order to recognize them. If you use words in your grammars that are not in the dictionary, a message similar to the following will be printed:
grammar ‘name’ used words not found in the pronunciation dictionary: notaword
If you get a message like this, try changing the words in your grammars by splitting up the words or using to similar words, e.g. changing “natlink” to “nat link”. You may also try to add words to the pronunciation dictionary.
Spoken Language Support
There are a only handful of languages with models and dictionaries available from Source Forge, although it is possible to build your own language model using lmtool or pronunciation dictionary using lextool. There is also a CMU Sphinx tutorial on building language models.
I have tested Russian and Chinese models and dictionaries with this engine
implementation. They both work, though the latter required recompiling
CMU Pocket Sphinx with the FSG_PNODE_CTXT_BVSZ
constant manually set to
6 or higher. There may be encoding errors, depending on Python version and
platform.
Dragonfly Lists and DictLists
Dragonfly Lists
and DictLists
function as normal, private
rules for the Pocket Sphinx engine. On updating a Dragonfly list or
dictionary, the grammar they are part of will be reloaded. This is because
there is unfortunately no JSGF equivalent for lists.
Engine API
- class SphinxEngine[source]
Speech recognition engine back-end for CMU Pocket Sphinx.
- check_valid_word(word)[source]
Check if a word is in the current Sphinx pronunciation dictionary.
This method must be called after
connect()
.- Return type:
bool
- property config
Python module/object containing engine configuration.
You will need to restart the engine with
disconnect()
andconnect()
if the configuration has been changed afterconnect()
has been called.- Returns:
config module/object
- connect()[source]
Set up the CMU Pocket Sphinx decoder.
This method does nothing if the engine is already connected.
- create_timer(callback, interval, repeating=True)[source]
Create and return a timer using the specified callback and repeat interval.
Note
Timers only run when the engine is processing audio.
- disconnect()[source]
Deallocate the CMU Sphinx decoder and any other resources used by it. If the engine is currently recognizing, the recognition loop will be terminated first.
This method unloads all loaded grammars.
Multiplexing interface for the CMU Pocket Sphinx engine
- class SphinxTimerManager(interval, engine)[source]
Timer manager for the CMU Pocket Sphinx engine.
This class allows running timer functions if the engine is currently processing audio via one of three engine processing methods:
process_buffer()
process_wave_file()
do_recognition()
Note
Long-running timers will block Dragonfly from processing what was said, so be careful how you use them!
Audio frames will not normally be dropped because of timers, long-running or otherwise.
Sphinx Recognition Results Class
- class Results(hypothesis, type, audio_buffers)[source]
CMU Pocket Sphinx recognition results class.
- property audio_buffers
The audio for this recognition, if any.
- property grammar
The grammar which processed this recognition, if any.
- property hypothesis
The final hypothesis for this recognition.
- property recognition_type
The type of this recognition.
- property rule
The rule that matched this recognition, if any.