Showing posts with label torch. Show all posts
Showing posts with label torch. Show all posts

Wednesday, February 8, 2017

installing word-rnn on ubuntu server 16.04.

to install the pcre luarock:
sudo /mirror/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/usr/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/
Because the PCRE files end up in some very weird places. 


Tuesday, January 24, 2017

RNN with Torch and MPI

This is being installed on machines running Ubuntu Server 16.04.1 LTS.  Does not work on Linux Mint (the torch install script doesn't detect that OS).

Most of the following installations have to be performed on each computer.  I didn't re-download everything, since it was going to be put in the same place, but I did cd in and re-run the installation procedure.  That ensured the necessary files were added to all the right places elsewhere in the system.

Here, I'm walking through the process of running Torch on a cluster.  CPUs, not GPUs.  The performance benefit comes from the slave nodes being allowed greater latitude in searching for local optima to 'solve' the neural net.  Every so often, they 'touch base' with the master node and synchronize the result of their computations.  Read the abstract of Sixin Zhang's paper to get a more detailed idea of what's happening.  As far as the implementation goes, "the idea is to transform the torch data structure (tensor, table etc) into a storage (contiguous in memory) and then send/recv [sic] it." src.

Background Sources

Keep track of where I found the info I used to figure this out.

https://bbs.archlinux.org/viewtopic.php?id=159999
http://torch.ch/docs/getting-started.html
https://groups.google.com/forum/#!topic/torch7/Xs814a5_xgI

Set up MPI (beowulf cluster)

Follow the instructions in these two posts first.  They get you to the point of a working cluster, starting from a collection of unused PCs and the relevant hardware.

https://nixingaround.blogspot.com/2017/01/a-homebrew-beowulf-cluster-part-1.html
https://nixingaround.blogspot.com/2017/01/a-homemade-beowulf-cluster-part-2.html

prevent SSH from losing connection

I had some trouble here, where I was trying to use ssh over the same wires that were providing MPI communication in the cluster.  I kept losing connection after initializing the computations.  It may not be necessary, so I wouldn't do this unless you run into trouble of that sort.  

https://nixingaround.blogspot.com/2017/01/internet-via-ethernet-ssh-via-wireless.html

Ok, that's not an optimal solution. Better to initialize a virtual terminal and run the computations in that.  When the connection is inevitably dropped, just recover that terminal.

http://unix.stackexchange.com/questions/22781/how-to-recover-a-shell-after-a-disconnection

Install Torch

Note: it may be useful to install the MKL library ahead of torch.  It accelerates the math routines that I assume will be present in the computations I'm going to perform.  

This provides dependencies needed to install the mpiT package that lets Torch7 work with MPI.  Start in the breca home directory.  On the master node, run the following.
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:
cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies.  Leaving this note here in case of future problems.]

After the install script finished running, it told me that it had not updated my shell profile.  So, we're adding a line to the ~/.profile script.  (we're using that, and not the bashrc file, because when logging on to the breca account bash isn't automatically run.  If I ever forget and try to use Torch without bash, I could run into problems this can avoid.)

Do the following on all nodes:
echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profile
Now re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster.  (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)

Test that Torch has been installed:
th
Close the program
exit

MPI compatibility with Torch

Source: https://github.com/sixin-zh/mpiT

Do this on the master node. You'll be able to access the downloaded files from all the nodes - they're going in the /mirror directory. Download from github and install.
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.
cd 
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr.  See link.
export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),
luarocks make mpit-mvapich-1.rockspec

Tests

First, figure out how many processors you have.  You did already; that's the sum of the numbers in your machinefile in the /mirror directory.  We'll say you have 12 cores.  Since our counting starts at 0, tell the computer you have 11.  Adjust according to your actual situation. 

Next, use a bunch of terminals and log into each of your nodes simultaneously.  Install:
sudo apt-get install htop 
And run
htop
on each machine and watch the CPU usage as you perform the following tests.  If only the master node shows activity, you have a problem.  

Create ./data/torch7 in the home directory, and then download the test data to that location.  Ensure you're logged in as the MPI user.
mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
Now run the tests. Sanity check: did mpiT install successfully? Note: I ran into an 'error 75' at this point, and the solution was to explicitly define the location of the files involved starting from the root directory. 
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.lua
Check that the MPI integration is working.  Move down to the folder with the asynchronous algorithms.
cd asyncsgd
I think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere.  I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.lua
Test bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd.  I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node.  As long as it works...?  It seems to be checking the bandwidth through the cluster.  There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.lua 
Try parallel mnist training - this is the one that should tell you what's up.  AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned.  If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua.  In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
and this is as far as I've actually made it without errors (up to this point, barring abnormalities in the PCs used, everything works perfectly for me).

Install Word RNN

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Sunday, January 15, 2017

A Recurrent Neural Network to generate sentences: step-by-step assembly (Mint version)

This one predicts (ok, procedurally generates) text word-by-word.  Others do it letter-by-letter, but then you have to deal with misspellings.  src. Inspiration.

Dependencies

I'm basing this off an installation script referenced in the github readme (here).  Instead of blindly running the script, I checked its contents and discovered it fails when run on linux mint.  So I simplified it, and changed/added a thing or two since it wasn't working anyway.  In case anyone wants to blindly run my version, here it is all cleaned up.

Updating your sources is always a good idea prior to installing a bunch of software.  Also, the package below makes it easy to add new repositories (see below).
sudo apt-get update
sudo apt-get install -y python-software-properties
Now things are getting serious.  This stuff includes GCC 4.9, since it seems GCC 5 isn't compatible with Torch 7.  There was a weird error where cmake wasn't installed when I put all the packages in the same apt-get install command - I split out the packages to their own lines, ran my install script again, and it worked.  Very odd.  Hence, my inefficient code below.
sudo apt-get install -y build-essential 
sudo apt-get install -y gcc 
sudo apt-get install -y g++ 
sudo apt-get install -y curl 
sudo apt-get install -y cmake 
sudo apt-get install -y libreadline-dev 
sudo apt-get install -y git-core 
sudo apt-get install -y libqt4-core 
sudo apt-get install -y libqt4-gui 
sudo apt-get install -y libqt4-dev 
sudo apt-get install -y libjpeg-dev 
sudo apt-get install -y libpng-dev 
sudo apt-get install -y ncurses-dev 
sudo apt-get install -y imagemagick 
sudo apt-get install -y libzmq3-dev 
sudo apt-get install -y gfortran 
sudo apt-get install -y unzip 
sudo apt-get install -y gnuplot 
sudo apt-get install -y gnuplot-x11 
sudo apt-get install -y ipython
sudo apt-get install -y libpcre3-dev
These are for a program called OpenBLAS, which stands for Basic Linear Algebra Subprograms.  Here's a bit more on the concept.
sudo apt-get install -y libopenblas-dev liblapack-dev
Install Torch (finally).  Note that when running the ./install.sh command, I'm letting bash automatically enter "Y" when prompted for a user response.  It's requesting permission to add torch to the system path in the .bashrc: essentially, creating a system shortcut for each initialization of the bash terminal.
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; 
echo "Y" | ./install.sh 
source ~/.bashrc
These tools are installed with Lua, a programming language that came with Torch.
/home/$USER/torch/install/bin/luarocks install nngraph
/home/$USER/torch/install/bin/luarocks install nninit 
/home/$USER/torch/install/bin/luarocks install optim
/home/$USER/torch/install/bin/luarocks install nn
/home/$USER/torch/install/bin/luarocks install underscore.lua --from=http://marcusirven.s3.amazonaws.com/rocks/
sudo /home/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/lib/x86_64-linux-gnu/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/ 
Make sure to add Torch to your PATH.  This ensures that when you run the command to use Torch, your computer can actually find the program.
to_path="/home/$USER/torch/install/bin"
echo "PATH=$PATH:$to_path" >> /home/$USER/.bashrc
source ~/.bashrc
With that, all dependencies should be satisfied. Time to install.

Installation

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Extra Fixes

I'm running torch on a gpu-less computer.  There's a glitch that occurs when running the test script in that scenario.  To avert it, you have to change the name of a function from CudaTensor to Tensor.  See here for details.  
cd word-rnn/util
nano SharedDropout.lua
The third line should look like this:
SharedDropout_noise = torch.CudaTensor()
Change it to this:
SharedDropout_noise = torch.Tensor()
And save.
CTRL + O
CTRL + X

Running word-rnn

Now you can run the test function.  Be aware that on a 4-core 2.8GHz i7 processor, this command took 21 hours to complete.  BUT it's a very cool command, and the result is amazing.
th train.lua -gpuid -1
To be continued...

Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...