This is part 2, where I realize that converting an offline transcriber to a different language on my own is a semi-herculean task. In the issue tracker for alumae's github project, there's a conversation revolving around an English conversion (
https://github.com/alumae/kaldi-offline-transcriber/issues/6). I'm used that to find someone else's project that converts the code to work in English.
There's three similar githubs that I'm going to try this time:
github.com/srvk/eesen-transcriber,
github.com/srvk/eesen, and
github.com/srvk/srvk-eesen-offline-transcriber. I think
the second is the base package, so I'm going to give that a shot first. The
third is a high-level abstraction that makes it easier to transcribe something, and the
first appears to be a virtual machine that you can just download and run (more or less). The only issue with the VM is that you need to dedicate 8GB of RAM...and I don't have that much to give away. So I'm going to try the others first. The use of Vagrant is unfamiliar, but I looked at the source website and the concept is pretty cool. It solves a lot of portability issues that I was planning on kicking down the road.
Actually, here's an
explanation of several of the repos' by the author:
We have changed to use other models by brute force; taking out
much of the Estonian and replacing with parts of Kaldi recipes that
do decoding (for example the tedlium recipe). It mostly requires
performing surgery on the Makefile. :)
In particular, for English we do only one pass of decoding, with only
one LM and decoding graph, and skip compounding.
I recently updated a system to use even more different decoding: neural net decoding based on Yajie Miao's EESEN (github.com/yajiemiao/eesen). You could find the resulting code on the SRVK repo here: github.com/srvk/eesen-transcriber.
I think they want people to use the VMs rather than run it straight on their computers. It's certainly more consistent, but also more resource-intensive. I can't do that right now.
Attempt No. 1: Installing eesen
I did run into one hiccup. The make command includes running a script to check for dependencies, which looks for the program libtool. It uses the command
to do this. Only problem is, libtool doesn't quite work like that. You actually need to install libtool-bin if you want that dependency check to work. See
here for details. Upshot is, install libtool-bin.
sudo apt-get install libtool-bin
Start by downloading eesen into to your ~/tools directory. Rename it to eesen-master for clarity's sake. When you compile, don't forget to run make -j 4 if you can.
cd ~/tools/
git clone https://github.com/srvk/eesen.git
mv ./eesen ./eesen-master
cd ./eesen-master/tools
make
./install_atlas.sh
./install_srilm.sh
Great! Now EESEN is installed. I don't know of any checks to perform, aside from whether the make command completed successfully.
Installing srvk-eesen-offline-transcriber
This is the thing that should make using eesen easy(-er). Clone it and build it. Since it's a customized version of alumae's kaldi-offline-transcriber, it should install the same way.
Dependencies
Make sure you have this stuff (I assume, since it's required for kaldi-offline-transcriber).
sudo apt-get install build-essential ffmpeg sox libatlas-dev python-pip
You need the OpenFST library, which Kaldi installs when you compile it. However, since we aren't (necessarily) installing Kaldi, I don't know how to make sure you have OpenFST. Try this, see if it works; if it doesn't, go
here for as much information as I am aware of.
Next thing to do is cd into the directory where you're going to put the ESSEN easy transcriber package, and clone the repository.
cd ~/tools
git clone https://github.com/srvk/srvk-eesen-offline-transcriber.git
cd into the repository you just cloned.
cd ~/tools/srvk-eesen-offline-transcriber
The documentation for the srvk-eesen-offline-transcriber is atrocious. You can tell the author. The next step should be to download acoustic and language models, before adding configuration options to the make file and building the transcriber (this
is supposed to be based on alumae's Estonian version). Oh, well. Leeroy Jenkins!
make .init
Well, that did
something.
cat > ./makefile.options [enter]
KALDI_ROOT=/home/$USER/tools/kaldi-master [CTRL-D]
Did nothing whatsoever. I think I'm just missing the language models, and I don't see anywhere to download them.
Ok, this looks like a dead end.
Attempt No. 2: Using the EESEN Virtual Machine
I'll try the repo I
listed first, that has the Vagrant VM set up. Here goes.
sudo apt-get install virtualbox vagrant
Now clone the repository.
cd ~/tools
git clone http://github.com/srvk/eesen-transcriber
and cd into it.
cd ./essen-transcriber
This is why the method is so easy - just run
vagrant up
from inside that folder, and everything is downloaded and installed automagically. Of course, it's downloading a whole preinstalled Ubuntu OS (Ubuntu 14.04 x86, by the look of the terminal output). Reminds me of some very hackish python solutions I came up with when I was first learning the language. I'm not a fan, but at least something is working. If I can track down the setup scripts it's running, I'll try and replicate the VM on my computer's installation.
Expect a
lot of output. So far, vagrant has claimed 2 of my CPUs and has nearly filled my 8 GB of RAM. This is the only time I've ever seen my computer use swap space. Clever, I'm watching my system resources and virtualbox seems to be switching off which CPUs are being used. Probably a temperature thing.
Once that's done, you can run the example transcription with the following command.
vagrant ssh -c "vids2web.sh /vagrant/test2.mp3"
or you can ssh into the VM with this command
vagrant ssh
and then change directories to /home/vagrant/tools/eesen-offline-transcriber where there are readme instructions.
cd /home/vagrant/tools/eesen-offline-transcriber
You can run transcription on an arbitrary audio file (this build is designed to be friendly to a whole bunch of audio formats) with the following command. Note that speech2text.sh is located in the directory you just changed into above (eesen-offline-transcriber).
./speech2text.sh --txt ./build/output/test2.txt /vagrant/test2.mp3
Read speech2text.sh to see how it works; in this example, the output .txt file is located in ./build/output/ and the audio file is in the user directory. Here's the output, so you can get an idea of the quality. This is an excerpt from King Solomon's Mines.
You're warriors much grow where we have resting on their spears introduce.
By law there was one war just after we destroyed the people that came down upon us but it was a civil war dog a dog.
How was that my lord became my half brother had a brother born at the same birth and have the same woman it is not our custom on hard to suffer twins to live the weak are always must died.
But the mother of looking hit away the people child which was born in the last for her heart and over it and that child is to all the king.
In contrast, here's the original:
"Your warriors must grow weary of resting on their spears, Infadoos."
"My lord, there was one war, just after we destroyed the people that came down upon us, but it was a civil war; dog ate dog."
"How was that?"
"My lord the king, my half-brother, had a brother born at the same birth, and of the same woman. It is not our custom, my lord, to suffer twins to live; the weaker must always die. But the mother of the king hid away the feebler child, which was born the last, for her heart yearned over it, and that child is Twala the king.
Unfortunately, the word error rate (WER) is too high to be particularly useful -
19.4%, which is 1/5 of a text. Try reading anything hair one fifth of the words are smog. It's not even that guessable. The other systems I was trying to make work reached 13-9% accuracy; but that was in Estonian.
There's one more thing to try 'easily', which is to
add my own language model - whatever that means.
Since the issue with the kaldi-speech-transcriber of part 1 was a lack of an English language model, maybe the next step could be creating / fitting an English model from existing material to work in that context. I have no idea how large that project would be. Another option is to look at what speechkitchen.org is doing about improving accuracy. I do know they took some shortcuts to get eesen up and running.
That's all for now.