Tuesday, August 9, 2016

Using PocketSphinx within Python Code

Here's the source for what I've been working on.

Looks like my installation records will have to be updated to account for a different installation source, and maybe a different version of the source code.

Ok, here's the process so far.  Install sphinxbase and pocketsphinx from GitHub - this means using the bleeding-edge versions, rather than the tried-and true alpha5 versions that I talked about in previous posts.  This just seems to work better.  Once this is all figured out, I'll go back and clean those up.
cd ~/tools
git clone https://github.com/cmusphinx/sphinxbase.git
cd ./sphinxbase
./autogen.sh
./configure
make
make check
make install

cd ~/tools
git clone https://github.com/cmusphinx/pocketsphinx.git
cd ./pocketsphinx
./autogen.sh
./configure
make clean all
make check
sudo make install
Now look inside the pocketsphinx directory:
cd ~/tools/pocketsphinx/swig/python/test
There's a whole bunch of test scripts that walk you through the implementation of pocketsphinx in python.  It's basically done for you.  Check the one called kws-test.py -- that's the one that will wait to hear a keyword, run a command when it does, then resume listening.  Perfect!

I'm going to assume that you've already created your own voice model based on the other posts in this blog, and that you've got a directory dedicated to command and control experiments.

If that's not true, then just mess with the script without moving it.  Just make a backup.  The only effective difference is that the detection will be less accurate; for the purposes of this tutorial, ignore the rest of the code down to where I've pasted my copy of the python script.  The only thing you should change has to do with reading from the microphone rather than an audio file; change the script to match what I've got here.  You're done now.  The rest of this tutorial is for those who have already created their own voice model.  See others of my posts for how to do that.
# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()
Ok.  For the rest of us, let's get back to messing with this script.  While still in the test directory,
mkdir ~/tools/cc_ex
cp ./kws_test.py ~/tools/cc_ex/kws_test.py
cd ~/tools/cc_ex/
gedit kws_test.py
There's a few changes to make in the python script.  Make sure the model directory has been adjusted.  Also, the script by default is checking in a .raw audio file for the keyword: uncomment and comment the relevant lines so the script uses pyaudio to record from the microphone.  The full text of my version of the script is below.

Note that the keyphrase it's looking for is the word 'and'.  Pretty simple, and very likely to have been covered a lot in the voice training.

Note also that there's a weird quirk in the detection - you have to speak quickly.  I tried for a long time making long, sonorous 'aaaannnnddd' noises at my microphone, and it didn't pick up.  Finally gave a short, staccato 'and' - it detected me right away.  Did it five more times, and it picked me up each time.  I don't see a way to get around that - I think it's built into the buffer, so it won't even hear the whole thing otherwise.  Or maybe I just said 'and' in the training really fast each time, though I don't think that's likely.
#!/usr/bin/python

import sys, os
from pocketsphinx.pocketsphinx import *
from sphinxbase.sphinxbase import *


modeldir = "~/tools/train-voice-data-pocketsphinx"

# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', os.path.join(modeldir, 'neo-en/en-us'))
config.set_string('-dict', os.path.join(modeldir, 'neo-en/cmudict-en-us.dict'))
config.set_string('-keyphrase', 'and')
config.set_float('-kws_threshold', 1e+1)
#config.set_string('-logfn', '/dev/null')


# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()

# Process audio chunk by chunk. On keyphrase detected perform action and restart search
decoder = Decoder(config)
decoder.start_utt()
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        print ("Detected keyphrase, restarting search")
        decoder.end_utt()
        decoder.start_utt()
Anyway, that's all.  If it doesn't work, don't blame me.  That's as dead simple as I know how to make it.

Training a CMU Sphinx Language Model for Command and Control

CMU Sphinx is advanced enough to use its understanding of grammar to help it figure out the likelihood that a particular word was spoken.  To do this, it needs to have a predefined concept of which words tend to follow each other -- it needs to understand the format of what is spoken to it.  The context of a 'command and control' AI has a very specific type of grammar involved, where the format is predominately commands and statements.

If CMU Sphinx has been made to recognize that, it will be able to filter words that don't make sense in that context and weight more heavily words that do make sense as control words: it will know that 'play music' is more likely than 'pink music', and 'shutdown' is more likely to be a command than 'showdown'.

For now, here's the primary sources:
http://cmusphinx.sourceforge.net/wiki/tutoriallm
http://www.speech.cs.cmu.edu/tools/lmtool-new.html

Using this, it should be possible to create the grammar language model based on a big list of sentences; only problem is, I don't have a sentence list.  Once that LM has been created, the voice data I've created should be retrained - even that is done based on grammar statistics.


Wednesday, August 3, 2016

Improving the Accuracy of CMU Sphinx for a Limited Vocabulary


Update: I finished my tool for creating a customized voice model.  It encapsulates the best of what I described below.  See here: https://github.com/umhau/vmc.

The idea with a limited vocabulary is that the processor can deal with far less information in order to detect the words needed. You don't have to train it on a complete set of words in the English language, and you don't need a supercomputer.  All you have to do is teach it a few words, and how to spell them.  The tutorial is here.  I've created a script to automate the voice recording here, and stashed the needed files there with it.

Preparation

Alright, down to business.  You'll find it handy to keep a folder for these sorts of programs.
mkdir ~/tools
cd ~/tools
Install git, if you don't have it already.
sudo apt-get install git
Download the script I made into your new tools folder.
sudo git clone https://github.com/umhau/train-voice-data-pocketsphinx.git
Install SphinxTrain.  I included it among the files you just downloaded.  Move it up to ~/tools, extract and install it.  It's also here, if you don't want to use the one I provided.
sudo mv ~/tools/train-voice-data-pocketsphinx/extra_files/sphinxtrain-5prealpha.tar.gz ~/tools
sudo tar -xvzf ~/tools/sphinxtrain-5prealpha.tar.gz -C ~/tools
cd sphinxtrain-5prealpha
./configure
make -j 4
make -j 4 install

Record Your Voice

Enter this directory, run the script.  It'll have a basic walkthrough built-in.  This will help you record the data you need.  For experimental purposes, 20 recordings is enough for about 10% relative improvement in accuracy.  Use the name neo-en for your training data, assuming you're working in English.
cd ./train-voice-data-pocketsphinx
python train_voice_model.py
You'll find your recordings in a subfolder with the same name as what you specified.  Go there.
cd ./neo-en
By the way, if you ever change your mind about what you want your model to be named, there's a fantastic program called pyrenamer that can make it easy to rename all the files you created.  Install it with:
sudo apt-get install pyrenamer

Process Your Voice Recordings

Great!  Done with that part.  Now we're going to copy some other directories into the current working directory to 'work on them'.
cp -a /usr/local/share/pocketsphinx/model/en-us/en-us .
cp -a /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict .
cp -a /usr/local/share/pocketsphinx/model/en-us/en-us.lm.bin .
Based on this source, it looks like we shouldn't be working with .dmp files.   This is a point of deviation from the (outdated) CMU walkthrough.  Copy the .bin file instead.  Difference is explained below, sourced from the tutorial.
Language model can be stored and loaded in three different format - text ARPA format, binary format BIN and binary DMP format. ARPA format takes more space but it is possible to edit it. ARPA files have .lm extension. Binary format takes significantly less space and faster to load. Binary files have .lm.bin extension. It is also possible to convert between formats. DMP format is obsolete and not recommended.
Now, while still in this directory, generate some 'acoustic feature files'.
sphinx_fe -argfile en-us/feat.params -samprate 16000 -c neo-en.fileids -di . -do . -ei wav -eo mfc -mswav yes

Get the Full-Sized Language Model

Nice.  You have a bunch more files with weird extensions on them.  Now it's time to convert them.  You need the full version of the language model, which was not shared with your original installation for size reasons.  I included it in the github repository, or you can download it from here (you want the file named cmusphinx-en-us-ptm-5.2.tar.gz).  Put the extracted files in your neo-en directory.

Assuming you use the one from the github repo and you're still in the neo-en subdirectory,
tar -xvzf ../extra_files/cmusphinx-en-us-ptm-5.2.tar.gz -C .
There's an folder labeled en-us within the neo-en folder that was created when you made the acoustic feature files.  Give it an extension and save it in case of horrible mistakes.
mv ./en-us ./en-us-original
Now move the newly extracted directory to your neo-en folder, and rename it to en-us.
mv ./cmusphinx-en-us-ptm-5.2 ./en-us
This converts the binary mdef file into a text file.
pocketsphinx_mdef_convert -text ./en-us/mdef ./en-us/mdef.txt

Grab Some Tools

Now you need some more tools to work with the data.  These are from SphinxTrain, which you installed earlier.  You should still be in your working directory, neo-en.  Use ls to see what tools are available in the directory.
ls /usr/local/libexec/sphinxtrain
cp /usr/local/libexec/sphinxtrain/bw .
cp /usr/local/libexec/sphinxtrain/map_adapt .
cp /usr/local/libexec/sphinxtrain/mk_s2sendump .
cp /usr/local/libexec/sphinxtrain/mllr_solve .

Run 'bw' Command to Collect Statistics on Your Voice

Now you're going to run a very long command that is designed to collect statistics about your voice.  Those backslashes -- the \ things -- tell bash to ignore the following character: in this case, newline characters.  That's how this command is stretching over multiple lines.
./bw \
 -hmmdir en-us \
 -moddeffn en-us/mdef.txt \
 -ts2cbfn .ptm. \
 -feat 1s_c_d_dd \
 -svspec 0-12/13-25/26-38 \
 -cmn current \
 -agc none \
 -dictfn cmudict-en-us.dict \
 -ctlfn neo-en.fileids \
 -lsnfn neo-en.transcription \
 -accumdir .
Future note, for using the continuous model instead of the PTM model (from the tutorial):
Make sure the arguments in bw command should match the parameters in feat.params file inside the acoustic model folder. Please note that not all the parameters from feat.param are supported by bw, only a few of them. bw for example doesn't suppport upperf or other feature extraction params. You only need to use parameters which are accepted, other parameters from feat.params should be skipped. 
For example, for continuous model you don't need to include the svspec option. Instead, you need to use just -ts2cbfn .cont. For semi-continuous models use -ts2cbfn .semi. If model has `feature_transform` file like en-us continuous model, you need to add -lda feature_transform argument to bw, otherwise it will not work properly.

More Commands

Now it's time to adapt the model.  Looks like continuous will be better to use in the long run, but first we're just going to get this working.  The tutorial suggests that using MLLR and MAP adaptation methods together is best, but it looks like so far we're just using them sequentially.  Here goes:
./mllr_solve \
 -meanfn en-us/means \
 -varfn en-us/variances \
 -outmllrfn mllr_matrix -accumdir .
It appears this adapted model is now completed!  Nice work.  To use it, add -mllr mllr_matrix to your PocketSphinx command line.  I'll put complete commands at the bottom of this note.  

Now we're going to do the MAP adaptation method, which is being used on top of the MLLR method.  Back up the files you were just working on:
cp -a en-us en-us-adapt
To run the MAP adaptation:
./map_adapt \
 -moddeffn en-us/mdef.txt \
 -ts2cbfn .ptm. \
 -meanfn en-us/means \
 -varfn en-us/variances \
 -mixwfn en-us/mixture_weights \
 -tmatfn en-us/transition_matrices \
 -accumdir . \
 -mapmeanfn en-us-adapt/means \
 -mapvarfn en-us-adapt/variances \
 -mapmixwfn en-us-adapt/mixture_weights \
 -maptmatfn en-us-adapt/transition_matrices

[Optional; saves some space]

...I think.  Apparently it's now important to recreate a sendump file from a newly updated mixture_weights file.
./mk_s2sendump \
 -pocketsphinx yes \
 -moddeffn en-us-adapt/mdef.txt \
 -mixwfn en-us-adapt/mixture_weights \
 -sendumpfn en-us-adapt/sendump

Testing the Model

It's also important to test the adaptation quality.  This actually gives you a benchmark - a word error rate (WER).  See here.

Create Test Data

Use another script I made to record test data.  It's almost the same, but the fileids and transcription file formats are different.  The folder with the test data should end up in the neo-en directory. Use the directory name I provide, test-data.
python ../create_test_records.py
Run the decoder on the test files.  Go back into the neo-en folder.
pocketsphinx_batch \
 -adcin yes \
 -cepdir ./test-data \
 -cepext .wav \
 -ctl ./test-data/test-data.fileids \
 -lm en-us.lm.bin \
 -dict cmudict-en-us.dict \
 -hmm en-us-adapt \
 -hyp ./test-data/test-data.hyp
Use this tool to actually test the accuracy of the model.  You'll need a working pocketsphinx installation, since it's just a wrapper with a word comparison engine over the transcription engine.  Look at the end of the output; it'll give you some percentages indicating accuracy.
../../pocketsphinx-5prealpha/test/word_align.pl \
 ./test-data/test-data.transcription \
 ./test-data/test-data.hyp

Live Testing

If you just want to try out your new language model, record a file and try to transcribe it with these commands (assuming you're still in the neo-en working directory):
python ../record_test_voice.py
pocketsphinx_continuous -hmm ./en-us-adapt -infile ../test-file.wav
Or, if you'd rather use a microphone and record live, use this command:
pocketsphinx_continuous -hmm ./en-us-adapt -inmic yes
With 110 voice records and using 20 records as testing, I achieved 60% accuracy. At 400 records and a marginal mic, I achieved 77% accuracy.  There's about 1000 records available.

Achieving Optimal Accuracy

You'll want to create your own language model if you're going to be using a specialized language. That's a pain, and you have to know what you're going to use it for ahead of time.  If I do that, I'll collect the words from the tools where I specified them and automagically rebuild the language model.  For now, I think I can get away with using the default lm.

For actual use of the model, everything you need is in en-us-adapt.  That's what you use when you need to refer in a command to your language-model.

Use the following command to transcribe a file, if you've created your own lm and language dictionary:
pocketsphinx_continuous 
 -hmm <your_new_model_folder> \
 -lm <your_lm> \
 -dict <your_dict> \
 -infile test.wav
Upon testing it appears that a less controlled environment might be useful, as the transcription was almost perfect when I was able to recreate the atmosphere of the original training records and pretty bad otherwise.

Conclusions

I've made a bunch of scripts in the github repo that automate some of this stuff, assuming standard installs.  Look for check_accuracy.sh and create_model.sh.  Everything should be run inside the neo-en folder, except the original train_voice_model.py script.

TODO next -
  1. set up the more accurate continuous model
  2. create a script that generates words in faux-sentences based on my use case scenario.  
    1. find a phonetic dictionary that covers my needs
    2. figure out what my use case actually is

Tuesday, August 2, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 3: Sphinx, not Kaldi

How to install PocketSphinx 5Prealpha on Mint 17.3.

We're going to install work with these packages in a folder located at ~/tools.  Make sure this exists.
mkdir ~/tools
Download pocketsphinx and sphinxbase from the downloads page:
Move the files from your downloads to your project folder and extract them.
tar -xzf ~/Downloads/sphinxbase-5prealpha.tar.gz -C ~/tools/
tar -xzf ~/Downloads/pocketsphinx-5prealpha.tar.gz -C ~/tools/
Make sure dependencies are installed.  You're installing libpulse-dev so that sphinxbase will configure itself to work with PulseAudio, the recommended audio framework on Ubuntu (and, by extension, on Mint).
sudo apt-get install python-dev pulseaudio libpulse-dev gcc automake autoconf libtool bison swig
Note: make sure that swig is at least version 2.0.  You can check with this command:
dpkg -p swig | grep Version
Move into the sphinxbase folder.
cd ~/tools/sphinxbase-5prealpha
Since you downloaded the release version, the configure file has already been generated.  It's time to configure, make and make install!
./configure
make
sudo make install
Sphinxbase is installed in /usr/local/lib; in case Mint 17 doesn't look there for program libraries, you have to manually tell it to use that location.  Here's the commands:
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
Now move into the pocketsphinx folder and do the same installation:
cd ~/tools/pocketsphinx-5prealpha
./configure
make
sudo make install
you can test the installation by running the following; it should be recognizing what you speak into the microphone.
pocketsphinx_continuous -inmic yes
If you want to transcribe a file, use this command:
pocketsphinx_continuous -infile file.wav
If you run into trouble, this should help.








Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...