Tuesday, July 26, 2016

Setting Up an Offline Transcriber Using Kaldi - Part 1: kaldi-offline-transcriber

This is being recorded as I go.  I'll be editing it and changing it to reflect the best way to set it up.  My goal is to be able to record a snippet of my voice and have it transcribed by a python script I'll write.

First Attempt: Kaldi-offline-transcriber

The first shot at completing this project is this GitHub: github.com/alumae/kaldi-offline-transcriber. The only problem is that this transcriber, though excellent of itself, is built for the Estonian language.  After I successfully get it working in Estonian, I'll see what I can do about English.

I should note that the instructions in the github readme are excellent.  I've rewritten them here so I have easy access to them, and to make them a little better -- just made them cut-and-paste worthy, mostly.

Dependencies Installation

Not sure if this comes with Ubuntu 16.04 or if I'd already installed this for something else, but make sure this is installed.  
sudo apt-get install build-essential
Also install these:
sudo apt-get install ffmpeg sox libatlas-dev 
Install Kaldi.  Don't have to worry about the online extensions, but it won't hurt to have them installed (an extra file compiled in a directory is the only difference).

Make sure Python and pip are installed.
sudo apt-get install python-pip
Install the package pyfst.  One of its dependencies, OpenFst, was compiled and installed with Kaldi.  To exploit that installation, use these install flags when you install pyfst:
CPPFLAGS="-I/home/$USER/tools/kaldi-master/tools/openfst/include -L/home/$USER/tools/kaldi-master/tools/openfst/lib" pip install pyfst
Turns out you also need Java installed, which isn't mentioned in the readme file.  
sudo apt-get install default-jre

Installing the Main Package

Clone the repository.
cd ~/tools
git clone https://github.com/alumae/kaldi-offline-transcriber.git
This is Estonian, remember?  Download and unpack the Estonian language models.
cd ~/tools/kaldi-offline-transcriber
curl http://bark.phon.ioc.ee/tanel/kaldi-offline-transcriber-data-2015-12-29.tgz | tar xvz 
Create a file in the root of the transcriber directory called makefile.options.  Inside, set the KALDI_ROOT option as the root of the kaldi directory.  Use [enter] and [CTRL-D] to complete the command.
cat > ~/tools/kaldi-offline-transcriber/Makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]
Without this the compiler will throw an error wondering where the files it's trying to compile are located.  Next, compile.  This should take about 30 minutes, so use the option for multiple cores if possible.
cd ~/tools/kaldi-offline-transcriber/
make -j 4 .init
All compilations are stored under the kaldi-offline-transcriber/build/ directory.  If you want to retry the compilation, just delete that directory and try again.

Example Usage

Using the make command directly

Stick a speech file under src-audio, then execute the command to create the transcription file.  
cd src-audio
wget http://media.kuku.ee/intervjuu/intervjuu201306211256.mp3
cd ..
make build/output/intervjuu201306211256.txt
To remove the intermediate files that are generated with the build command, run:
make .intervjuu201306211256.clean

Using the speech2text.sh script

There was a wrapper created to more easily transcribe audio files located in any directory.  This is accessed with the following example command:
/home/$USER/tools/kaldi-offline-transcriber/speech2text.sh --trs result/test.txt audio/test.ogg

Tweaks

You can speed up transcription by setting another parameter in makefile.options.
nano ~/tools/kaldi-offline-transcriber/Makefile.options
nthreads = 4



Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...