Showing posts with label offline-transcription. Show all posts
Showing posts with label offline-transcription. Show all posts

Tuesday, August 9, 2016

Using PocketSphinx within Python Code

Here's the source for what I've been working on.

Looks like my installation records will have to be updated to account for a different installation source, and maybe a different version of the source code.

Ok, here's the process so far.  Install sphinxbase and pocketsphinx from GitHub - this means using the bleeding-edge versions, rather than the tried-and true alpha5 versions that I talked about in previous posts.  This just seems to work better.  Once this is all figured out, I'll go back and clean those up.
cd ~/tools
git clone https://github.com/cmusphinx/sphinxbase.git
cd ./sphinxbase
./autogen.sh
./configure
make
make check
make install

cd ~/tools
git clone https://github.com/cmusphinx/pocketsphinx.git
cd ./pocketsphinx
./autogen.sh
./configure
make clean all
make check
sudo make install
Now look inside the pocketsphinx directory:
cd ~/tools/pocketsphinx/swig/python/test
There's a whole bunch of test scripts that walk you through the implementation of pocketsphinx in python.  It's basically done for you.  Check the one called kws-test.py -- that's the one that will wait to hear a keyword, run a command when it does, then resume listening.  Perfect!

I'm going to assume that you've already created your own voice model based on the other posts in this blog, and that you've got a directory dedicated to command and control experiments.

If that's not true, then just mess with the script without moving it.  Just make a backup.  The only effective difference is that the detection will be less accurate; for the purposes of this tutorial, ignore the rest of the code down to where I've pasted my copy of the python script.  The only thing you should change has to do with reading from the microphone rather than an audio file; change the script to match what I've got here.  You're done now.  The rest of this tutorial is for those who have already created their own voice model.  See others of my posts for how to do that.
# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()
Ok.  For the rest of us, let's get back to messing with this script.  While still in the test directory,
mkdir ~/tools/cc_ex
cp ./kws_test.py ~/tools/cc_ex/kws_test.py
cd ~/tools/cc_ex/
gedit kws_test.py
There's a few changes to make in the python script.  Make sure the model directory has been adjusted.  Also, the script by default is checking in a .raw audio file for the keyword: uncomment and comment the relevant lines so the script uses pyaudio to record from the microphone.  The full text of my version of the script is below.

Note that the keyphrase it's looking for is the word 'and'.  Pretty simple, and very likely to have been covered a lot in the voice training.

Note also that there's a weird quirk in the detection - you have to speak quickly.  I tried for a long time making long, sonorous 'aaaannnnddd' noises at my microphone, and it didn't pick up.  Finally gave a short, staccato 'and' - it detected me right away.  Did it five more times, and it picked me up each time.  I don't see a way to get around that - I think it's built into the buffer, so it won't even hear the whole thing otherwise.  Or maybe I just said 'and' in the training really fast each time, though I don't think that's likely.
#!/usr/bin/python

import sys, os
from pocketsphinx.pocketsphinx import *
from sphinxbase.sphinxbase import *


modeldir = "~/tools/train-voice-data-pocketsphinx"

# Create a decoder with certain model
config = Decoder.default_config()
config.set_string('-hmm', os.path.join(modeldir, 'neo-en/en-us'))
config.set_string('-dict', os.path.join(modeldir, 'neo-en/cmudict-en-us.dict'))
config.set_string('-keyphrase', 'and')
config.set_float('-kws_threshold', 1e+1)
#config.set_string('-logfn', '/dev/null')


# Open file to read the data
# stream = open(os.path.join(datadir, "test-file.wav"), "rb")

# Alternatively you can read from microphone
import pyaudio
 
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=1024)
stream.start_stream()

# Process audio chunk by chunk. On keyphrase detected perform action and restart search
decoder = Decoder(config)
decoder.start_utt()
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        print ("Detected keyphrase, restarting search")
        decoder.end_utt()
        decoder.start_utt()
Anyway, that's all.  If it doesn't work, don't blame me.  That's as dead simple as I know how to make it.

Wednesday, July 27, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 2: EESEN

This is part 2, where I realize that converting an offline transcriber to a different language on my own is a semi-herculean task.  In the issue tracker for alumae's github project, there's a conversation revolving around an English conversion (https://github.com/alumae/kaldi-offline-transcriber/issues/6).  I'm used that to find someone else's project that converts the code to work in English.

There's three similar githubs that I'm going to try this time: github.com/srvk/eesen-transcriber, github.com/srvk/eesen, and github.com/srvk/srvk-eesen-offline-transcriber.  I think the second is the base package, so I'm going to give that a shot first.  The third is a high-level abstraction that makes it easier to transcribe something, and the first appears to be a virtual machine that you can just download and run (more or less).  The only issue with the VM is that you need to dedicate 8GB of RAM...and I don't have that much to give away.  So I'm going to try the others first.  The use of Vagrant is unfamiliar, but I looked at the source website and the concept is pretty cool.  It solves a lot of portability issues that I was planning on kicking down the road.

Actually, here's an explanation of several of the repos' by the author:
We have changed to use other models by brute force; taking out
much of the Estonian and replacing with parts of Kaldi recipes that
do decoding (for example the tedlium recipe). It mostly requires
performing surgery on the Makefile. :)

In particular, for English we do only one pass of decoding, with only
one LM and decoding graph, and skip compounding.
I recently updated a system to use even more different decoding: neural net decoding based on Yajie Miao's EESEN (github.com/yajiemiao/eesen).  You could find the resulting code on the SRVK repo here: github.com/srvk/eesen-transcriber.
I think they want people to use the VMs rather than run it straight on their computers.  It's certainly more consistent, but also more resource-intensive.  I can't do that right now.

Attempt No. 1: Installing eesen

I did run into one hiccup.  The make command includes running a script to check for dependencies, which looks for the program libtool.  It uses the command 
which libtool
to do this.  Only problem is, libtool doesn't quite work like that.  You actually need to install libtool-bin if you want that dependency check to work.  See here for details.  Upshot is, install libtool-bin.
sudo apt-get install libtool-bin
Start by downloading eesen into to your ~/tools directory.  Rename it to eesen-master for clarity's sake.  When you compile, don't forget to run make -j 4 if you can.
cd ~/tools/
git clone https://github.com/srvk/eesen.git
mv ./eesen ./eesen-master
cd ./eesen-master/tools
make
./install_atlas.sh
./install_srilm.sh
Great!  Now EESEN is installed.  I don't know of any checks to perform, aside from whether the make command completed successfully.

Installing srvk-eesen-offline-transcriber

This is the thing that should make using eesen easy(-er).  Clone it and build it.  Since it's a customized version of alumae's kaldi-offline-transcriber, it should install the same way.

Dependencies

Make sure you have this stuff (I assume, since it's required for kaldi-offline-transcriber).
sudo apt-get install build-essential ffmpeg sox libatlas-dev python-pip
You need the OpenFST library, which Kaldi installs when you compile it.  However, since we aren't (necessarily) installing Kaldi, I don't know how to make sure you have OpenFST.  Try this, see if it works; if it doesn't, go here for as much information as I am aware of.
pip install pyfst
Next thing to do is cd into the directory where you're going to put the ESSEN easy transcriber package, and clone the repository.  
cd ~/tools
git clone https://github.com/srvk/srvk-eesen-offline-transcriber.git
cd into the repository you just cloned.
cd ~/tools/srvk-eesen-offline-transcriber
The documentation for the srvk-eesen-offline-transcriber is atrocious.  You can tell the author.  The next step should be to download acoustic and language models, before adding configuration options to the make file and building the transcriber (this is supposed to be based on alumae's Estonian version).  Oh, well.  Leeroy Jenkins!
make .init
Well, that did something.
cat > ./makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]
Did nothing whatsoever.  I think I'm just missing the language models, and I don't see anywhere to download them.

Ok, this looks like a dead end.

Attempt No. 2: Using the EESEN Virtual Machine 

I'll try the repo I listed first, that has the Vagrant VM set up.  Here goes.
sudo apt-get install virtualbox vagrant
Now clone the repository.
cd ~/tools
git clone http://github.com/srvk/eesen-transcriber
and cd into it.
cd ./essen-transcriber
This is why the method is so easy - just run
vagrant up
from inside that folder, and everything is downloaded and installed automagically.  Of course, it's downloading a whole preinstalled Ubuntu OS (Ubuntu 14.04 x86, by the look of the terminal output).  Reminds me of some very hackish python solutions I came up with when I was first learning the language.  I'm not a fan, but at least something is working.  If I can track down the setup scripts it's running, I'll try and replicate the VM on my computer's installation.

Expect a lot of output.  So far, vagrant has claimed 2 of my CPUs and has nearly filled my 8 GB of RAM.  This is the only time I've ever seen my computer use swap space.  Clever, I'm watching my system resources and virtualbox seems to be switching off which CPUs are being used.  Probably a temperature thing.

Once that's done, you can run the example transcription with the following command.
vagrant ssh -c "vids2web.sh /vagrant/test2.mp3"
or you can ssh into the VM with this command
vagrant ssh
and then change directories to /home/vagrant/tools/eesen-offline-transcriber where there are readme instructions.
cd /home/vagrant/tools/eesen-offline-transcriber
You can run transcription on an arbitrary audio file (this build is designed to be friendly to a whole bunch of audio formats) with the following command.  Note that speech2text.sh is located in the directory you just changed into above (eesen-offline-transcriber).
./speech2text.sh --txt ./build/output/test2.txt /vagrant/test2.mp3
Read speech2text.sh to see how it works; in this example, the output .txt file is located in ./build/output/ and the audio file is in the user directory.  Here's the output, so you can get an idea of the quality.  This is an excerpt from King Solomon's Mines.
You're warriors much grow where we have resting on their spears introduce.
By law there was one war just after we destroyed the people that came down upon us but it was a civil war dog a dog.
How was that my lord became my half brother had a brother born at the same birth and have the same woman it is not our custom on hard to suffer twins to live the weak are always must died.
But the mother of looking hit away the people child which was born in the last for her heart and over it and that child is to all the king.
In contrast, here's the original:
"Your warriors must grow weary of resting on their spears, Infadoos."
"My lord, there was one war, just after we destroyed the people that came down upon us, but it was a civil war; dog ate dog."
"How was that?"
"My lord the king, my half-brother, had a brother born at the same birth, and of the same woman. It is not our custom, my lord, to suffer twins to live; the weaker must always die. But the mother of the king hid away the feebler child, which was born the last, for her heart yearned over it, and that child is Twala the king.
Unfortunately, the word error rate (WER) is too high to be particularly useful - 19.4%, which is 1/5 of a text.  Try reading anything hair one fifth of the words are smog.  It's not even that guessable.  The other systems I was trying to make work reached 13-9% accuracy; but that was in Estonian.

There's one more thing to try 'easily', which is to add my own language model - whatever that means.

Since the issue with the kaldi-speech-transcriber of part 1 was a lack of an English language model, maybe the next step could be creating / fitting an English model from existing material to work in that context.  I have no idea how large that project would be.  Another option is to look at what speechkitchen.org is doing about improving accuracy.  I do know they took some shortcuts to get eesen up and running.

That's all for now.

Tuesday, July 26, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 1: kaldi-offline-transcriber

This is being recorded as I go.  I'll be editing it and changing it to reflect the best way to set it up.  My goal is to be able to record a snippet of my voice and have it transcribed by a python script I'll write.

First Attempt: Kaldi-offline-transcriber

The first shot at completing this project is this GitHub: github.com/alumae/kaldi-offline-transcriber. The only problem is that this transcriber, though excellent of itself, is built for the Estonian language.  After I successfully get it working in Estonian, I'll see what I can do about English.

I should note that the instructions in the github readme are excellent.  I've rewritten them here so I have easy access to them, and to make them a little better -- just made them cut-and-paste worthy, mostly.

Dependencies Installation

Not sure if this comes with Ubuntu 16.04 or if I'd already installed this for something else, but make sure this is installed.  
sudo apt-get install build-essential
Also install these:
sudo apt-get install ffmpeg sox libatlas-dev 
Install Kaldi.  Don't have to worry about the online extensions, but it won't hurt to have them installed (an extra file compiled in a directory is the only difference).

Make sure Python and pip are installed.
sudo apt-get install python-pip
Install the package pyfst.  One of its dependencies, OpenFst, was compiled and installed with Kaldi.  To exploit that installation, use these install flags when you install pyfst:
CPPFLAGS="-I/home/$USER/tools/kaldi-master/tools/openfst/include -L/home/$USER/tools/kaldi-master/tools/openfst/lib" pip install pyfst
Turns out you also need Java installed, which isn't mentioned in the readme file.  
sudo apt-get install default-jre

Installing the Main Package

Clone the repository.
cd ~/tools
git clone https://github.com/alumae/kaldi-offline-transcriber.git
This is Estonian, remember?  Download and unpack the Estonian language models.
cd ~/tools/kaldi-offline-transcriber
curl http://bark.phon.ioc.ee/tanel/kaldi-offline-transcriber-data-2015-12-29.tgz | tar xvz 
Create a file in the root of the transcriber directory called makefile.options.  Inside, set the KALDI_ROOT option as the root of the kaldi directory.  Use [enter] and [CTRL-D] to complete the command.
cat > ~/tools/kaldi-offline-transcriber/Makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]
Without this the compiler will throw an error wondering where the files it's trying to compile are located.  Next, compile.  This should take about 30 minutes, so use the option for multiple cores if possible.
cd ~/tools/kaldi-offline-transcriber/
make -j 4 .init
All compilations are stored under the kaldi-offline-transcriber/build/ directory.  If you want to retry the compilation, just delete that directory and try again.

Example Usage

Using the make command directly

Stick a speech file under src-audio, then execute the command to create the transcription file.  
cd src-audio
wget http://media.kuku.ee/intervjuu/intervjuu201306211256.mp3
cd ..
make build/output/intervjuu201306211256.txt
To remove the intermediate files that are generated with the build command, run:
make .intervjuu201306211256.clean

Using the speech2text.sh script

There was a wrapper created to more easily transcribe audio files located in any directory.  This is accessed with the following example command:
/home/$USER/tools/kaldi-offline-transcriber/speech2text.sh --trs result/test.txt audio/test.ogg

Tweaks

You can speed up transcription by setting another parameter in makefile.options.
nano ~/tools/kaldi-offline-transcriber/Makefile.options
nthreads = 4



Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...