*nixing Around: speech to text

Showing posts with label speech to text. Show all posts

Tuesday, October 11, 2016

Building a Statistical Language Model

Update: I finished my script for creating custom language models. See here: https://github.com/umhau/vmc.

There's a summary at the end with what I figured out. Most of this is me thinking on paper.

The statistical language model is used for helping CMU Sphinx know what words exist, and what the order the words exist in (the grammar and syntax structure). The intro website to all this is here.

I'm trying to decide between the SRILM and the MITLM packages [subsequent edit: also the logios package and the quicklm pearl script - these are referenced in hard-to-find places on the CMU website; see here and here, respectively] [another subsequent edit: looks like I found a link to the official CMU Statistical Language Model toolkit - it was buried in the QuickLM script]. S- is easier to use, apparently, and the CMU site provides example commands. M-, however, seems more likely to stick around and be accessible on github for the long-term. Plus, I forked it.

[sorry, blogger's formatting broke and I had to convert everything to plaintext and start over...lost the links.]

Only downside is, the main contributor to MITLM stopped work on it about 6 mos ago, and started dealing with Kaldi instead. Guess he figured the newer tech was more worth his time. Still, dinosaurs have their place; just watch Space Cowboys to get the picture.

MITLM

Just to be sure that the software doesn't go anywhere, code is downloaded from my repository.

Update: Thanks to Qi Wang's comment below there's an extra dependency to install:

sudo apt-get install autoconf-archive

Installation of MITLM:

cd ~/tools
git clone https://github.com/umhau/mitlm.git
cd ./mitlm
./autogen.sh
./configure
make
make install

~~So, turns out that there's some weird problems with the installation. Something changed, or something isn't being installed properly. The compilation seems to fail with these errors:~~

./configure: line 19641: AX_CXX_HEADER_TR1_UNORDERED_MAP: command not found
./configure: line 19642: syntax error near unexpected token `noext,'
./configure: line 19642: `AX_CXX_COMPILE_STDCXX_11(noext, optional)'

~~g++ wasn't installed, but even after that was added it still wouldn't work.~~

Update: Unfortunately, I've lost track of other dependencies involved - at some point, I'll make a list of all the stuff I've installed while working on this project. Had to install libtool (or similar?) to get here. Mental note:

libtoolize:   error: Failed to create 'build-aux'

But, that's because I'm trying to do this on a different Mint installation from my usual - on my default workstation, that dependency is installed (no idea what it is, except that it's probably listed somewhere on this blog).

After installing the extra dependency, the installation works! So this is a viable avenue thus far to get the LM working. I've already made it past where I need the MITLM, though, so I'm going to let it be for now. Might have to come back for it.

SRILM

Ok, let's see what SRILM has to offer us. It's more inconvenient to install; ya have to go through a license agreement to download it, so I can't just stick a bash command here.

...unless I put the code on my github. In which case, it's easy to get a copy of. Too bad there's too many files to put up an extracted version, and too bad the compressed version is more than 25mb. Time to split up the tar.gz file again; for my own records, here's how I split it. All I need for getting and using it is the reconstruction bit.

The splitting part, given the archive file:

split -b 24m -d srilm-1.7.1.tar.gz srilm-1.7.1.tar.gz.part-

Alright. Once the file is on github, it's just more copy-pasting.

cd ~/tools
git clone https://github.com/umhau/srilm.git
cd ./srilm
cat srilm-1.7.1.tar.gz.part-* | tar -xz

By the way, WOW. The installation process for this software is not straightforward. See the install file for the instructions on installation - read for background, then copy-paste below as usual.

gedit ./INSTALL

Step 2 - swap out the SRILM variable for one delimiting the root directory of the package. Source.

sed -i '7s#.*#SRILM = ~/tools/srilm#' ./Makefile

For now, assuming that the variables are all good. I don't know if I want maximum entropy models, though it sounds useful...I'll see what happens if I don't prep them.

Installing John Ousterhout's TCL toolkit - we're past the required v7.3, and up to 8.6: hope this still works. I'm compiling from source rather than using the available binaries 'cause they come with some kind of non-commercial/education license, which I don't like being tied down by.

cd ~/tools
git clone https://github.com/umhau/tcl-tk.git
cd ./tcl-tk
gunzip < tcl8.6.6-src.tar.gz | tar xvf -
gunzip < tk8.6.6-src.tar.gz | tar xvf -

Install TCL:

cd tcl8.6.6/unix
# chmod +x configure
configure --enable-threads
make -j 3
make test 
sudo make -j 3 install

Let's try running the rest without the TK stuff...even though John says it's needed. Heh. Leeeroooy Jenkins!

cd ../../../srilm
make World

...aaaaaaaand, Fail.

This is going nowhere fast. We're in dependency hell. Let's try the perl script CMU uses (it's the backend to the online service they officially reference).

The Perl Script

Thankfully, Mint comes with perl installed. So, the question is how to use the script.

cd ~/tools
mkdir ./CMU_LMtool && cd ./CMU_LMtool
wget http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl

The only thing left here is to figure out how to use the script...having never used perl, this could be interesting. Dug this nugget out of the script:

usage: quick_lm -s <sentence_file> [-w <word_file>] [-d discount]

So, the idea with the LMtool is to process sentences that the decoder should recognize - it doesn't need to be an exhaustive list, however, because the decoder will allow fragments to recombine in the detection phase. As a corpus example (from the CMU website), here's the following:

THIS IS AN EXAMPLE SENTENCE
EACH LINE IS SOMETHING THAT YOU'D WANT YOUR SYSTEM TO RECOGNIZE
ACRONYMS PRONOUNCED AS LETTERS ARE BEST ENTERED AS A T_L_A
NUMBERS AND ABBREVIATIONS OUGHT TO BE SPELLED OUT FOR EXAMPLE
TWO HUNDRED SIXTY THREE ET CETERA
YOU CAN UPLOAD A FEW THOUSAND SENTENCES
BUT THERE IS A LIMIT

We'll use this sentence collection to test the perl script:

cd ~/tools/CMU_LMtool
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/ex-corpus.txt
perl quick_lm.pl -s ex-corpus.txt

Well, it did exactly nothing. No terminal output, no new files created in the directory, and no errors. Time to search the script for other possible output locations. How weird can it be?

...

Ok, solved the problem. Thank goodness for auto highlighting in Gedit. The authors used some kind of weird system for comments that I'm guessing was retired since this script was written. It seems to have been throwing the compiler for a loop:

=POD
/*
[some text wrapped by those comment markers]
*/
[more text, only wrapped by the '=' things]
=END

So, I re-commented all the introductory stuff, and put the fixed version in the github repo.

Summary of the Perl script

So, here's how it works: download the fixed script, give it a sentence list, and run the command. Simple. And, looking at the output, the function it performs is pretty simple too. Makes a list of all the 1, 2 and 3 - word groupings in the list.

Here's what to do:

mkdir ~/tools/CMU_LMtool && cd ~/tools/CMU_LMtool
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/ex-corpus.txt
wget https://raw.githubusercontent.com/umhau/misc-LMtools/master/quick_lm.pl
perl quick_lm.pl -s ex-corpus.txt

Still not sure what that does for me, but I have my LM!

Notes: I think the word list option in the command refers to the possibility of a limited vocabulary...not sure how that relates to words outside that list used in the sentence list. The discount in the command, however, is fixed at 0.5. Apparently Greg and Ben did some experiments to discover that's definitely the optimal setting.

Second Note: based on readings from the CMU website, this LM isn't good for much more than command-and-control - it can successfully detect short phrases accurately, but not long, drawn-out sentences. So it'll be good for most of what I want, but anything complex will need to be done with the CMULMTK package.

Hold on - the [-w <word_file>] option for a dictionary might be a request for output - not an extra input. And given that I do need an explicit dictionary for transcription, that's probably what it does. That would be wonderful. I can even use that sentence list for voice training - which would be a fabulous way to ensure accuracy.

Unfortunately, that's not the case. Oh, well.

The official CMU Statistical Language Model toolkit

Ok, maybe this'll do it for me. Here's the link to the source. The Perl script doesn't make all the different files I need - especially the pronunciation dictionary.

mkdir ./tools/CMUSLM
cd ./tools/CMUSLM
wget http://www.speech.cs.cmu.edu/SLM/CMU-Cam_Toolkit_v2.tar.gz
gunzip < CMU-Cam_Toolkit_v2.tar.gz | tar xv
cd ./CMU-Cam_Toolkit_v2

Wow, this is old. You have to uncomment something if your computer isn't running HP-UX, IRIX, SunOS, or Solaris. I'm pretty sure anything build in this decade needs uncomment, but if you're unsure the README mentions a script you can run to check for yourself:

bash endian.sh

Ok, uncomment:

sed -i '37s/#//' ./src/Makefile
cd src
make install

Hard to tell if this was successful. I get the impression watching this compile that it was written in the 80s, and updated for compatibility with something advertising a max capacity of 512 Mb of random access memory.

Time to dive into the html documentation, and figure out usage. The goal is to create the LM and DIC files - and a nice perk would be the other stuff produced by the online LM generator.

Turns out, there doesn't seem to be any kind of pronunciation dictionary produced by this tool. So it's no good.

The Logios Package

This seems to be the tool CMU claims was actually used in their website - and, indeed, some of their tools within the package are designed for use in a webform. So I might be on the right track. The only problem is, the input is not a list of sentences: it's a grammar file built by the Phoenix tool. No idea what that is or how it works.

CMU, get your act together! The website is nice, but I've got no recourse if it goes down. I want an independent system!

Here goes. Goal: LM and DIC files. Starting point: list of sentences.

Download the package. Even this isn't user-friendly - the folder structure is in html. I used wget recursively to download the webpages. See here for source on the command.

CMUDict

Actually, it seems like I could just use the dictionary directly. The whole problem is one of how to get the entries from this file into a subset file that holds just what I want - so I'll just write a small script to do just that. What a pain.

wget http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/sphinxdict/cmudict_SPHINX_40

I'll post the script soon - it's being added to a larger package that should make the process of getting a personal language model pretty painless. That'd be nice.

Wednesday, September 7, 2016

[shameless copy] Offline Language Model Creation for PocketSphinx

Normally, I'd be writing these myself. But this time, the explanation was so unusually good that I don't feel the need to simplify it. It's fantastic for my purposes as-is. Source.

The purpose here is to create the statistical language model that pocketsphinx uses to convert phonetics into words. The model is based entirely on what type of sentences it expects to encounter, as defined by the input reference text.

I need this running as a self-contained script in order to make language model generation a seamless part of my project. All the user should have to do is provide a ready-made reference text, and the script should generate the rest.

ARPA model training with CMUCLMTK

You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

The process for creating a language model is as follows:

1) Prepare a reference text that will be used to generate the language model. The language model toolkit expects its input to be in the form of normalized text files, with utterances delimited by <s> and </s> tags. A number of input filters are available for specific corpora such as Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the set of sentences that are bounded by the start and end sentence markers: <s> and </s>. Here's an example:

<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times </s>
<s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>
<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze </s>
<s> cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be 
light and patchy but heavier rain may develop in the west later </s>

More data will generate better language models. The weather.txt file from sphinx4 (used to generate the weather language model) contains nearly 100,000 sentences.

2) Generate the vocabulary file. This is a list of all the words in the file:

    text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab

3) You may want to edit the vocabulary file to remove words (numbers, misspellings, names). If you find misspellings, it is a good idea to fix them in the input transcript.

4) If you want a closed vocabulary language model (a language model that has no provisions for unknown words), then you should remove sentences from your input transcript that contain words that are not in your vocabulary file.

5) Generate the arpa format language model with the commands:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt
% idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \
     weather.vocab -arpa weather.lm

6) Generate the CMU binary form (BIN)

sphinx_lm_convert -i weather.lm -o weather.lm.bin

The CMUCLTK tools and commands are documented at The CMU-Cambridge Language Modeling Toolkit page.

Tuesday, August 9, 2016

Using PocketSphinx within Python Code

Here's the source for what I've been working on.

Looks like my installation records will have to be updated to account for a different installation source, and maybe a different version of the source code.

Ok, here's the process so far. Install sphinxbase and pocketsphinx from GitHub - this means using the bleeding-edge versions, rather than the tried-and true alpha5 versions that I talked about in previous posts. This just seems to work better. Once this is all figured out, I'll go back and clean those up.

cd ~/tools
git clone https://github.com/cmusphinx/sphinxbase.git
cd ./sphinxbase
./autogen.sh
./configure
make
make check
make install

cd ~/tools
git clone https://github.com/cmusphinx/pocketsphinx.git
cd ./pocketsphinx
./autogen.sh
./configure
make clean all
make check
sudo make install

Now look inside the pocketsphinx directory:

cd ~/tools/pocketsphinx/swig/python/test

There's a whole bunch of test scripts that walk you through the implementation of pocketsphinx in python. It's basically done for you. Check the one called kws-test.py -- that's the one that will wait to hear a keyword, run a command when it does, then resume listening. Perfect!

I'm going to assume that you've already created your own voice model based on the other posts in this blog, and that you've got a directory dedicated to command and control experiments.

If that's not true, then just mess with the script without moving it. Just make a backup. The only effective difference is that the detection will be less accurate; for the purposes of this tutorial, ignore the rest of the code down to where I've pasted my copy of the python script. The only thing you should change has to do with reading from the microphone rather than an audio file; change the script to match what I've got here. You're done now. The rest of this tutorial is for those who have already created their own voice model. See others of my posts for how to do that.

# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()

Ok. For the rest of us, let's get back to messing with this script. While still in the test directory,

mkdir ~/tools/cc_ex
cp ./kws_test.py ~/tools/cc_ex/kws_test.py
cd ~/tools/cc_ex/
gedit kws_test.py

There's a few changes to make in the python script. Make sure the model directory has been adjusted. Also, the script by default is checking in a .raw audio file for the keyword: uncomment and comment the relevant lines so the script uses pyaudio to record from the microphone. The full text of my version of the script is below.

Note that the keyphrase it's looking for is the word 'and'. Pretty simple, and very likely to have been covered a lot in the voice training.

Note also that there's a weird quirk in the detection - you have to speak quickly. I tried for a long time making long, sonorous 'aaaannnnddd' noises at my microphone, and it didn't pick up. Finally gave a short, staccato 'and' - it detected me right away. Did it five more times, and it picked me up each time. I don't see a way to get around that - I think it's built into the buffer, so it won't even hear the whole thing otherwise. Or maybe I just said 'and' in the training really fast each time, though I don't think that's likely.

#!/usr/bin/python

import sys, os
from pocketsphinx.pocketsphinx import *
from sphinxbase.sphinxbase import *


modeldir = "~/tools/train-voice-data-pocketsphinx"

# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', os.path.join(modeldir, 'neo-en/en-us'))
config.set_string('-dict', os.path.join(modeldir, 'neo-en/cmudict-en-us.dict'))
config.set_string('-keyphrase', 'and')
config.set_float('-kws_threshold', 1e+1)
#config.set_string('-logfn', '/dev/null')


# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()

# Process audio chunk by chunk. On keyphrase detected perform action and restart search
decoder = Decoder(config)
decoder.start_utt()
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        print ("Detected keyphrase, restarting search")
        decoder.end_utt()
        decoder.start_utt()

Anyway, that's all. If it doesn't work, don't blame me. That's as dead simple as I know how to make it.

Training a CMU Sphinx Language Model for Command and Control

CMU Sphinx is advanced enough to use its understanding of grammar to help it figure out the likelihood that a particular word was spoken. To do this, it needs to have a predefined concept of which words tend to follow each other -- it needs to understand the format of what is spoken to it. The context of a 'command and control' AI has a very specific type of grammar involved, where the format is predominately commands and statements.

If CMU Sphinx has been made to recognize that, it will be able to filter words that don't make sense in that context and weight more heavily words that do make sense as control words: it will know that 'play music' is more likely than 'pink music', and 'shutdown' is more likely to be a command than 'showdown'.

For now, here's the primary sources:
http://cmusphinx.sourceforge.net/wiki/tutoriallm
http://www.speech.cs.cmu.edu/tools/lmtool-new.html

Using this, it should be possible to create the grammar language model based on a big list of sentences; only problem is, I don't have a sentence list. Once that LM has been created, the voice data I've created should be retrained - even that is done based on grammar statistics.

Wednesday, August 3, 2016

Improving the Accuracy of CMU Sphinx for a Limited Vocabulary

Update: I finished my tool for creating a customized voice model. It encapsulates the best of what I described below. See here: https://github.com/umhau/vmc.

The idea with a limited vocabulary is that the processor can deal with far less information in order to detect the words needed. You don't have to train it on a complete set of words in the English language, and you don't need a supercomputer. All you have to do is teach it a few words, and how to spell them. The tutorial is here. I've created a script to automate the voice recording here, and stashed the needed files there with it.

Preparation

Alright, down to business. You'll find it handy to keep a folder for these sorts of programs.

mkdir ~/tools
cd ~/tools

Install git, if you don't have it already.

sudo apt-get install git

Download the script I made into your new tools folder.

sudo git clone https://github.com/umhau/train-voice-data-pocketsphinx.git

Install SphinxTrain. I included it among the files you just downloaded. Move it up to ~/tools, extract and install it. It's also here, if you don't want to use the one I provided.

sudo mv ~/tools/train-voice-data-pocketsphinx/extra_files/sphinxtrain-5prealpha.tar.gz ~/tools
sudo tar -xvzf ~/tools/sphinxtrain-5prealpha.tar.gz -C ~/tools
cd sphinxtrain-5prealpha
./configure
make -j 4
make -j 4 install

Record Your Voice

Enter this directory, run the script. It'll have a basic walkthrough built-in. This will help you record the data you need. For experimental purposes, 20 recordings is enough for about 10% relative improvement in accuracy. Use the name neo-en for your training data, assuming you're working in English.

cd ./train-voice-data-pocketsphinx
python train_voice_model.py

You'll find your recordings in a subfolder with the same name as what you specified. Go there.

cd ./neo-en

By the way, if you ever change your mind about what you want your model to be named, there's a fantastic program called pyrenamer that can make it easy to rename all the files you created. Install it with:

sudo apt-get install pyrenamer

Process Your Voice Recordings

Great! Done with that part. Now we're going to copy some other directories into the current working directory to 'work on them'.

cp -a /usr/local/share/pocketsphinx/model/en-us/en-us .
cp -a /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict .
cp -a /usr/local/share/pocketsphinx/model/en-us/en-us.lm.bin .

Based on this source, it looks like we shouldn't be working with .dmp files. This is a point of deviation from the (outdated) CMU walkthrough. Copy the .bin file instead. Difference is explained below, sourced from the tutorial.

Language model can be stored and loaded in three different format - text ARPA format, binary format BIN and binary DMP format. ARPA format takes more space but it is possible to edit it. ARPA files have .lm extension. Binary format takes significantly less space and faster to load. Binary files have .lm.bin extension. It is also possible to convert between formats. DMP format is obsolete and not recommended.

Now, while still in this directory, generate some 'acoustic feature files'.

sphinx_fe -argfile en-us/feat.params -samprate 16000 -c neo-en.fileids -di . -do . -ei wav -eo mfc -mswav yes

Get the Full-Sized Language Model

Nice. You have a bunch more files with weird extensions on them. Now it's time to convert them. You need the full version of the language model, which was not shared with your original installation for size reasons. I included it in the github repository, or you can download it from here (you want the file named cmusphinx-en-us-ptm-5.2.tar.gz). Put the extracted files in your neo-en directory.

Assuming you use the one from the github repo and you're still in the neo-en subdirectory,

tar -xvzf ../extra_files/cmusphinx-en-us-ptm-5.2.tar.gz -C .

There's an folder labeled en-us within the neo-en folder that was created when you made the acoustic feature files. Give it an extension and save it in case of horrible mistakes.

mv ./en-us ./en-us-original

Now move the newly extracted directory to your neo-en folder, and rename it to en-us.

mv ./cmusphinx-en-us-ptm-5.2 ./en-us

This converts the binary mdef file into a text file.

pocketsphinx_mdef_convert -text ./en-us/mdef ./en-us/mdef.txt

Grab Some Tools

Now you need some more tools to work with the data. These are from SphinxTrain, which you installed earlier. You should still be in your working directory, neo-en. Use ls to see what tools are available in the directory.

ls /usr/local/libexec/sphinxtrain
cp /usr/local/libexec/sphinxtrain/bw .
cp /usr/local/libexec/sphinxtrain/map_adapt .
cp /usr/local/libexec/sphinxtrain/mk_s2sendump .
cp /usr/local/libexec/sphinxtrain/mllr_solve .

Run 'bw' Command to Collect Statistics on Your Voice

Now you're going to run a very long command that is designed to collect statistics about your voice. Those backslashes -- the \ things -- tell bash to ignore the following character: in this case, newline characters. That's how this command is stretching over multiple lines.

./bw \
 -hmmdir en-us \
 -moddeffn en-us/mdef.txt \
 -ts2cbfn .ptm. \
 -feat 1s_c_d_dd \
 -svspec 0-12/13-25/26-38 \
 -cmn current \
 -agc none \
 -dictfn cmudict-en-us.dict \
 -ctlfn neo-en.fileids \
 -lsnfn neo-en.transcription \
 -accumdir .

Future note, for using the continuous model instead of the PTM model (from the tutorial):

Make sure the arguments in bw command should match the parameters in feat.params file inside the acoustic model folder. Please note that not all the parameters from feat.param are supported by bw, only a few of them. bw for example doesn't suppport upperf or other feature extraction params. You only need to use parameters which are accepted, other parameters from feat.params should be skipped.

For example, for continuous model you don't need to include the svspec option. Instead, you need to use just -ts2cbfn .cont. For semi-continuous models use -ts2cbfn .semi. If model has `feature_transform` file like en-us continuous model, you need to add -lda feature_transform argument to bw, otherwise it will not work properly.

More Commands

Now it's time to adapt the model. Looks like continuous will be better to use in the long run, but first we're just going to get this working. The tutorial suggests that using MLLR and MAP adaptation methods together is best, but it looks like so far we're just using them sequentially. Here goes:

./mllr_solve \
 -meanfn en-us/means \
 -varfn en-us/variances \
 -outmllrfn mllr_matrix -accumdir .

It appears this adapted model is now completed! Nice work. To use it, add -mllr mllr_matrix to your PocketSphinx command line. I'll put complete commands at the bottom of this note.

Now we're going to do the MAP adaptation method, which is being used on top of the MLLR method. Back up the files you were just working on:

cp -a en-us en-us-adapt

To run the MAP adaptation:

./map_adapt \
 -moddeffn en-us/mdef.txt \
 -ts2cbfn .ptm. \
 -meanfn en-us/means \
 -varfn en-us/variances \
 -mixwfn en-us/mixture_weights \
 -tmatfn en-us/transition_matrices \
 -accumdir . \
 -mapmeanfn en-us-adapt/means \
 -mapvarfn en-us-adapt/variances \
 -mapmixwfn en-us-adapt/mixture_weights \
 -maptmatfn en-us-adapt/transition_matrices

[Optional; saves some space]

...I think. Apparently it's now important to recreate a sendump file from a newly updated mixture_weights file.

./mk_s2sendump \
 -pocketsphinx yes \
 -moddeffn en-us-adapt/mdef.txt \
 -mixwfn en-us-adapt/mixture_weights \
 -sendumpfn en-us-adapt/sendump

Testing the Model

It's also important to test the adaptation quality. This actually gives you a benchmark - a word error rate (WER). See here.

Create Test Data

Use another script I made to record test data. It's almost the same, but the fileids and transcription file formats are different. The folder with the test data should end up in the neo-en directory. Use the directory name I provide, test-data.

python ../create_test_records.py

Run the decoder on the test files. Go back into the neo-en folder.

pocketsphinx_batch \
 -adcin yes \
 -cepdir ./test-data \
 -cepext .wav \
 -ctl ./test-data/test-data.fileids \
 -lm en-us.lm.bin \
 -dict cmudict-en-us.dict \
 -hmm en-us-adapt \
 -hyp ./test-data/test-data.hyp

Use this tool to actually test the accuracy of the model. You'll need a working pocketsphinx installation, since it's just a wrapper with a word comparison engine over the transcription engine. Look at the end of the output; it'll give you some percentages indicating accuracy.

../../pocketsphinx-5prealpha/test/word_align.pl \
 ./test-data/test-data.transcription \
 ./test-data/test-data.hyp

Live Testing

If you just want to try out your new language model, record a file and try to transcribe it with these commands (assuming you're still in the neo-en working directory):

python ../record_test_voice.py
pocketsphinx_continuous -hmm ./en-us-adapt -infile ../test-file.wav

Or, if you'd rather use a microphone and record live, use this command:

pocketsphinx_continuous -hmm ./en-us-adapt -inmic yes

With 110 voice records and using 20 records as testing, I achieved 60% accuracy. At 400 records and a marginal mic, I achieved 77% accuracy. There's about 1000 records available.

Achieving Optimal Accuracy

You'll want to create your own language model if you're going to be using a specialized language. That's a pain, and you have to know what you're going to use it for ahead of time. If I do that, I'll collect the words from the tools where I specified them and automagically rebuild the language model. For now, I think I can get away with using the default lm.

For actual use of the model, everything you need is in en-us-adapt. That's what you use when you need to refer in a command to your language-model.

Use the following command to transcribe a file, if you've created your own lm and language dictionary:

pocketsphinx_continuous 
 -hmm <your_new_model_folder> \
 -lm <your_lm> \
 -dict <your_dict> \
 -infile test.wav

Upon testing it appears that a less controlled environment might be useful, as the transcription was almost perfect when I was able to recreate the atmosphere of the original training records and pretty bad otherwise.

Conclusions

I've made a bunch of scripts in the github repo that automate some of this stuff, assuming standard installs. Look for check_accuracy.sh and create_model.sh. Everything should be run inside the neo-en folder, except the original train_voice_model.py script.

TODO next -

set up the more accurate continuous model
create a script that generates words in faux-sentences based on my use case scenario.

find a phonetic dictionary that covers my needs
figure out what my use case actually is

Tuesday, August 2, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 3: Sphinx, not Kaldi

How to install PocketSphinx 5Prealpha on Mint 17.3.

We're going to install work with these packages in a folder located at ~/tools. Make sure this exists.

mkdir ~/tools

Download pocketsphinx and sphinxbase from the downloads page:

Look for the package called sphinxbase-5prealpha.tar.gz. https://sourceforge.net/projects/cmusphinx/files/sphinxbase/5prealpha/
Look for the package called pocketsphinx-5prealpha.tar.gz. https://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/

Move the files from your downloads to your project folder and extract them.

tar -xzf ~/Downloads/sphinxbase-5prealpha.tar.gz -C ~/tools/
tar -xzf ~/Downloads/pocketsphinx-5prealpha.tar.gz -C ~/tools/

Make sure dependencies are installed. You're installing libpulse-dev so that sphinxbase will configure itself to work with PulseAudio, the recommended audio framework on Ubuntu (and, by extension, on Mint).

sudo apt-get install python-dev pulseaudio libpulse-dev gcc automake autoconf libtool bison swig

Note: make sure that swig is at least version 2.0. You can check with this command:

dpkg -p swig | grep Version

Move into the sphinxbase folder.

cd ~/tools/sphinxbase-5prealpha

Since you downloaded the release version, the configure file has already been generated. It's time to configure, make and make install!

./configure
make
sudo make install

Sphinxbase is installed in /usr/local/lib; in case Mint 17 doesn't look there for program libraries, you have to manually tell it to use that location. Here's the commands:

export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

Now move into the pocketsphinx folder and do the same installation:

cd ~/tools/pocketsphinx-5prealpha
./configure
make
sudo make install

you can test the installation by running the following; it should be recognizing what you speak into the microphone.

pocketsphinx_continuous -inmic yes

If you want to transcribe a file, use this command:

pocketsphinx_continuous -infile file.wav

If you run into trouble, this should help.

Sunday, July 31, 2016

Speech Recognition Final Verdict

Final strategy: I'm going to use CMU Sphinx with a small vocabulary trained to my voice for most commands. I'll use the kaldi-gstreamer-server, or maybe even an online service, for larger, arbitrary pieces of sound - stuff that I can't predict.

Which means that I'll have two separate, behemoth systems installed on the computer. Ouch. At least I can stream Kaldi from a different computer. Sphinx should be small enough to not be a problem.

Here's what I need to be able to train the command and control language model.

Wednesday, July 27, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 2: EESEN

This is part 2, where I realize that converting an offline transcriber to a different language on my own is a semi-herculean task. In the issue tracker for alumae's github project, there's a conversation revolving around an English conversion (https://github.com/alumae/kaldi-offline-transcriber/issues/6). I'm used that to find someone else's project that converts the code to work in English.

There's three similar githubs that I'm going to try this time: github.com/srvk/eesen-transcriber, github.com/srvk/eesen, and github.com/srvk/srvk-eesen-offline-transcriber. I think the second is the base package, so I'm going to give that a shot first. The third is a high-level abstraction that makes it easier to transcribe something, and the first appears to be a virtual machine that you can just download and run (more or less). The only issue with the VM is that you need to dedicate 8GB of RAM...and I don't have that much to give away. So I'm going to try the others first. The use of Vagrant is unfamiliar, but I looked at the source website and the concept is pretty cool. It solves a lot of portability issues that I was planning on kicking down the road.

Actually, here's an explanation of several of the repos' by the author:

We have changed to use other models by brute force; taking out
much of the Estonian and replacing with parts of Kaldi recipes that
do decoding (for example the tedlium recipe). It mostly requires
performing surgery on the Makefile. :)

In particular, for English we do only one pass of decoding, with only
one LM and decoding graph, and skip compounding.

I recently updated a system to use even more different decoding: neural net decoding based on Yajie Miao's EESEN (github.com/yajiemiao/eesen). You could find the resulting code on the SRVK repo here: github.com/srvk/eesen-transcriber.

I think they want people to use the VMs rather than run it straight on their computers. It's certainly more consistent, but also more resource-intensive. I can't do that right now.

Attempt No. 1: Installing eesen

I did run into one hiccup. The make command includes running a script to check for dependencies, which looks for the program libtool. It uses the command

which libtool

to do this. Only problem is, libtool doesn't quite work like that. You actually need to install libtool-bin if you want that dependency check to work. See here for details. Upshot is, install libtool-bin.

sudo apt-get install libtool-bin

Start by downloading eesen into to your ~/tools directory. Rename it to eesen-master for clarity's sake. When you compile, don't forget to run make -j 4 if you can.

cd ~/tools/
git clone https://github.com/srvk/eesen.git
mv ./eesen ./eesen-master
cd ./eesen-master/tools
make
./install_atlas.sh
./install_srilm.sh

Great! Now EESEN is installed. I don't know of any checks to perform, aside from whether the make command completed successfully.

Installing srvk-eesen-offline-transcriber

This is the thing that should make using eesen easy(-er). Clone it and build it. Since it's a customized version of alumae's kaldi-offline-transcriber, it should install the same way.

Dependencies

Make sure you have this stuff (I assume, since it's required for kaldi-offline-transcriber).

sudo apt-get install build-essential ffmpeg sox libatlas-dev python-pip

You need the OpenFST library, which Kaldi installs when you compile it. However, since we aren't (necessarily) installing Kaldi, I don't know how to make sure you have OpenFST. Try this, see if it works; if it doesn't, go here for as much information as I am aware of.

pip install pyfst

Next thing to do is cd into the directory where you're going to put the ESSEN easy transcriber package, and clone the repository.

cd ~/tools
git clone https://github.com/srvk/srvk-eesen-offline-transcriber.git

cd into the repository you just cloned.

cd ~/tools/srvk-eesen-offline-transcriber

The documentation for the srvk-eesen-offline-transcriber is atrocious. You can tell the author. The next step should be to download acoustic and language models, before adding configuration options to the make file and building the transcriber (this is supposed to be based on alumae's Estonian version). Oh, well. Leeroy Jenkins!

make .init

Well, that did something.

cat > ./makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]

Did nothing whatsoever. I think I'm just missing the language models, and I don't see anywhere to download them.

Ok, this looks like a dead end.

Attempt No. 2: Using the EESEN Virtual Machine

I'll try the repo I listed first, that has the Vagrant VM set up. Here goes.

sudo apt-get install virtualbox vagrant

Now clone the repository.

cd ~/tools
git clone http://github.com/srvk/eesen-transcriber

and cd into it.

cd ./essen-transcriber

This is why the method is so easy - just run

vagrant up

from inside that folder, and everything is downloaded and installed automagically. Of course, it's downloading a whole preinstalled Ubuntu OS (Ubuntu 14.04 x86, by the look of the terminal output). Reminds me of some very hackish python solutions I came up with when I was first learning the language. I'm not a fan, but at least something is working. If I can track down the setup scripts it's running, I'll try and replicate the VM on my computer's installation.

Expect a lot of output. So far, vagrant has claimed 2 of my CPUs and has nearly filled my 8 GB of RAM. This is the only time I've ever seen my computer use swap space. Clever, I'm watching my system resources and virtualbox seems to be switching off which CPUs are being used. Probably a temperature thing.

Once that's done, you can run the example transcription with the following command.

vagrant ssh -c "vids2web.sh /vagrant/test2.mp3"

or you can ssh into the VM with this command

vagrant ssh

and then change directories to /home/vagrant/tools/eesen-offline-transcriber where there are readme instructions.

cd /home/vagrant/tools/eesen-offline-transcriber

You can run transcription on an arbitrary audio file (this build is designed to be friendly to a whole bunch of audio formats) with the following command. Note that speech2text.sh is located in the directory you just changed into above (eesen-offline-transcriber).

./speech2text.sh --txt ./build/output/test2.txt /vagrant/test2.mp3

Read speech2text.sh to see how it works; in this example, the output .txt file is located in ./build/output/ and the audio file is in the user directory. Here's the output, so you can get an idea of the quality. This is an excerpt from King Solomon's Mines.

You're warriors much grow where we have resting on their spears introduce.
By law there was one war just after we destroyed the people that came down upon us but it was a civil war dog a dog.
How was that my lord became my half brother had a brother born at the same birth and have the same woman it is not our custom on hard to suffer twins to live the weak are always must died.
But the mother of looking hit away the people child which was born in the last for her heart and over it and that child is to all the king.

In contrast, here's the original:

"Your warriors must grow weary of resting on their spears, Infadoos."
"My lord, there was one war, just after we destroyed the people that came down upon us, but it was a civil war; dog ate dog."
"How was that?"
"My lord the king, my half-brother, had a brother born at the same birth, and of the same woman. It is not our custom, my lord, to suffer twins to live; the weaker must always die. But the mother of the king hid away the feebler child, which was born the last, for her heart yearned over it, and that child is Twala the king.

Unfortunately, the word error rate (WER) is too high to be particularly useful - 19.4%, which is 1/5 of a text. Try reading anything hair one fifth of the words are smog. It's not even that guessable. The other systems I was trying to make work reached 13-9% accuracy; but that was in Estonian.

There's one more thing to try 'easily', which is to add my own language model - whatever that means.

Since the issue with the kaldi-speech-transcriber of part 1 was a lack of an English language model, maybe the next step could be creating / fitting an English model from existing material to work in that context. I have no idea how large that project would be. Another option is to look at what speechkitchen.org is doing about improving accuracy. I do know they took some shortcuts to get eesen up and running.

That's all for now.

Tuesday, July 26, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 1: kaldi-offline-transcriber

This is being recorded as I go. I'll be editing it and changing it to reflect the best way to set it up. My goal is to be able to record a snippet of my voice and have it transcribed by a python script I'll write.

First Attempt: Kaldi-offline-transcriber

The first shot at completing this project is this GitHub: github.com/alumae/kaldi-offline-transcriber. The only problem is that this transcriber, though excellent of itself, is built for the Estonian language. After I successfully get it working in Estonian, I'll see what I can do about English.

I should note that the instructions in the github readme are excellent. I've rewritten them here so I have easy access to them, and to make them a little better -- just made them cut-and-paste worthy, mostly.

Dependencies Installation

Not sure if this comes with Ubuntu 16.04 or if I'd already installed this for something else, but make sure this is installed.

sudo apt-get install build-essential

Also install these:

sudo apt-get install ffmpeg sox libatlas-dev

Install Kaldi. Don't have to worry about the online extensions, but it won't hurt to have them installed (an extra file compiled in a directory is the only difference).

Make sure Python and pip are installed.

sudo apt-get install python-pip

Install the package pyfst. One of its dependencies, OpenFst, was compiled and installed with Kaldi. To exploit that installation, use these install flags when you install pyfst:

CPPFLAGS="-I/home/$USER/tools/kaldi-master/tools/openfst/include -L/home/$USER/tools/kaldi-master/tools/openfst/lib" pip install pyfst

Turns out you also need Java installed, which isn't mentioned in the readme file.

sudo apt-get install default-jre

Installing the Main Package

Clone the repository.

cd ~/tools
git clone https://github.com/alumae/kaldi-offline-transcriber.git

This is Estonian, remember? Download and unpack the Estonian language models.

cd ~/tools/kaldi-offline-transcriber
curl http://bark.phon.ioc.ee/tanel/kaldi-offline-transcriber-data-2015-12-29.tgz | tar xvz

Create a file in the root of the transcriber directory called makefile.options. Inside, set the KALDI_ROOT option as the root of the kaldi directory. Use [enter] and [CTRL-D] to complete the command.

cat > ~/tools/kaldi-offline-transcriber/Makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]

Without this the compiler will throw an error wondering where the files it's trying to compile are located. Next, compile. This should take about 30 minutes, so use the option for multiple cores if possible.

cd ~/tools/kaldi-offline-transcriber/
make -j 4 .init

All compilations are stored under the kaldi-offline-transcriber/build/ directory. If you want to retry the compilation, just delete that directory and try again.

Example Usage

Using the make command directly

Stick a speech file under src-audio, then execute the command to create the transcription file.

cd src-audio
wget http://media.kuku.ee/intervjuu/intervjuu201306211256.mp3
cd ..
make build/output/intervjuu201306211256.txt

To remove the intermediate files that are generated with the build command, run:

make .intervjuu201306211256.clean

Using the speech2text.sh script

There was a wrapper created to more easily transcribe audio files located in any directory. This is accessed with the following example command:

/home/$USER/tools/kaldi-offline-transcriber/speech2text.sh --trs result/test.txt audio/test.ogg

Tweaks

You can speed up transcription by setting another parameter in makefile.options.

nano ~/tools/kaldi-offline-transcriber/Makefile.options
nthreads = 4

Installing Kaldi and Kaldi-Gstreamer-server on Ubuntu 16.04

Notes on the process of installing Kaldi and Kaldi-GStreamer-server on Ubuntu 16.04 LTS. These were modified somewhat, since this is retroactively documented for my own benefit.

Kaldi is a state-of-the-art speech transcription engine, geared towards researchers and people who already know what they're doing. I'm just trying to set it up.

Decide where to put Kaldi and make that your new working directory.

mkdir ~/tools/
cd tools

Clone Kaldi from github.

git clone https://github.com/kaldi-asr/kaldi.git

cd into this new location.

cd ./kaldi-master/tools

Check for any dependencies. There were a few things I needed to add to my Ubuntu installation; don't remember what they were. Do whatever this output instructs.

extras/check_dependencies.sh

Now comes the actual installation.

make
cd ../src
./configure --shared
make depend
make

Run this next to install the online extensions.

make ext

Note: if you have more than one core in your machine, you can run make -j 4 to do make in parallel.

Congratulations. Kaldi is installed. Installing Kaldi-GStreamer-server:

Before actually installing the kaldi-gstreamer-server, there's a few more things to do with kaldi itself.
Compile the Gstreamer plugin. First, install dependencies. Note they are older versions of the packages. Make sure you get the right version. On Ubuntu/Debian, run:

sudo apt-get install libgstreamer1.0-dev gstreamer1.0-plugins-good gstreamer1.0-tools gstreamer1.0-pulseaudio

Kaldi-Gstreamer-server requires the gstreamer plugin to be compiled (makes sense).

cd ~/tools/kaldi-master/src/gst-plugin/
make depend
make

This folder (gst-plugin) should now contain the file libgstkaldi.so which contains the Gstreamer plugin.

Now it's time to install the kaldi-gstreamer-server package. First, more dependencies.

sudo apt-get install pip python-yaml python-gi
pip install tornado ws4py==0.3.2 pyyaml

Note: You might need to run pip as sudo. e.g. sudo pip install tornado, above.
Note: I couldn't figure out which YAML package to install, so I used both. At least, they're both installed, and I don't remember which I actually needed. If I do this again, I'll try to remember to change this.

Clone kaldi-gstreamer-server from GitHub into your tools folder.

cd ~/tools/
git clone https://github.com/alumae/kaldi-gstreamer-server.git

This completes the installation.

cd into the main folder.

cd ./kaldi-gstreamer-server/

Open the README file, peruse until understood.

gedit ./readme.md

Now you'll understand what I mean by server and worker. You can start the server with:

python kaldigstserver/master_server.py --port=8888

Before starting a worker, make sure that the GST plugin path includes the gstreamer plugin you compiled. If you put everything where I recommended, this is all you have to do:

export GST_PLUGIN_PATH=~/tools/kaldi-master/src/gst-plugin

Test to make sure it worked. If it fails, take a look at the README file again. This command should spit out a bunch of information. If it just says something like, 'not found', you did something wrong. I have no idea what.

gst-inspect-1.0 onlinegmmdecodefaster

Now you can start a worker.

python kaldigstserver/worker.py -u ws://localhost:8888/worker/ws/speech -c sample_worker.yaml

Example of how to use the server to transcribe text:

python kaldigstserver/client.py -r 32000 ~/tools/kaldi-gstreamer-server/test/data/english_test.raw

You can also use a Deep Neural Network (DNN) to process the data, but at time of writing the readme walkthrough was giving me errors.

That's it!