installing word-rnn on ubuntu server 16.04.
to install the pcre luarock:sudo /mirror/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/usr/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/
Because the PCRE files end up in some very weird places.
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies. Leaving this note here in case of future problems.]echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profileNow re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster. (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)th
Close the programexit
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.cd
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr. See link.export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),luarocks make mpit-mvapich-1.rockspec
sudo apt-get install htop
htop
on each machine and watch the CPU usage as you perform the following tests. If only the master node shows activity, you have a problem. mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.luaCheck that the MPI integration is working. Move down to the folder with the asynchronous algorithms.
cd asyncsgdI think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere. I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.luaTest bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd. I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node. As long as it works...? It seems to be checking the bandwidth through the cluster. There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.luaTry parallel mnist training - this is the one that should tell you what's up. AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned. If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua. In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
mkdir ~/projects cd projects git clone https://github.com/larspars/word-rnn.gitThat's actually all there is to it. Now cd into the word-rnn directory to run the test stuff. Before the tests and tools, though, there's a fix that you have to perform.
sudo apt-get update sudo apt-get install -y python-software-propertiesNow things are getting serious. This stuff includes GCC 4.9, since it seems GCC 5 isn't compatible with Torch 7. There was a weird error where cmake wasn't installed when I put all the packages in the same apt-get install command - I split out the packages to their own lines, ran my install script again, and it worked. Very odd. Hence, my inefficient code below.
sudo apt-get install -y build-essential
sudo apt-get install -y gcc
sudo apt-get install -y g++
sudo apt-get install -y curl
sudo apt-get install -y cmake
sudo apt-get install -y libreadline-dev
sudo apt-get install -y git-core
sudo apt-get install -y libqt4-core
sudo apt-get install -y libqt4-gui
sudo apt-get install -y libqt4-dev
sudo apt-get install -y libjpeg-dev
sudo apt-get install -y libpng-dev
sudo apt-get install -y ncurses-dev
sudo apt-get install -y imagemagick
sudo apt-get install -y libzmq3-dev
sudo apt-get install -y gfortran
sudo apt-get install -y unzip
sudo apt-get install -y gnuplot
sudo apt-get install -y gnuplot-x11
sudo apt-get install -y ipython
sudo apt-get install -y libpcre3-dev
These are for a program called OpenBLAS, which stands for Basic Linear Algebra Subprograms. Here's a bit more on the concept.sudo apt-get install -y libopenblas-dev liblapack-dev
Install Torch (finally). Note that when running the ./install.sh command, I'm letting bash automatically enter "Y" when prompted for a user response. It's requesting permission to add torch to the system path in the .bashrc: essentially, creating a system shortcut for each initialization of the bash terminal.git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch;
echo "Y" | ./install.sh
source ~/.bashrc
These tools are installed with Lua, a programming language that came with Torch. /home/$USER/torch/install/bin/luarocks install nngraph
/home/$USER/torch/install/bin/luarocks install nninit
/home/$USER/torch/install/bin/luarocks install optim
/home/$USER/torch/install/bin/luarocks install nn
/home/$USER/torch/install/bin/luarocks install underscore.lua --from=http://marcusirven.s3.amazonaws.com/rocks/
sudo /home/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/lib/x86_64-linux-gnu/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/
Make sure to add Torch to your PATH. This ensures that when you run the command to use Torch, your computer can actually find the program. to_path="/home/$USER/torch/install/bin"
echo "PATH=$PATH:$to_path" >> /home/$USER/.bashrc
source ~/.bashrc
With that, all dependencies should be satisfied. Time to install. mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
cd word-rnn/util
nano SharedDropout.lua
The third line should look like this:SharedDropout_noise = torch.CudaTensor()
Change it to this:SharedDropout_noise = torch.Tensor()
And save.CTRL + O
CTRL + X
th train.lua -gpuid -1
To be continued...I'm switching over to github pages . The continuation of this blog (with archives included) is at umhau.github.io . By the way, the ...