installing word-rnn on ubuntu server 16.04.
to install the pcre luarock:sudo /mirror/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/usr/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/
Because the PCRE files end up in some very weird places.
ifconfig
This shows all interfaces, not just the activated ones. ifconfig -a
to activate a hardware device after determining its name, run:ifconfig "device name" up
Bonus: this gives you way more than you'll ever need to know about the hardware capabilities of the device.iw list
iw list
Supported interface modes: * IBSS * managed * AP * AP/VLAN * monitor
sudo apt-get install rfkill hostapd hostap-utils iw dnsmasq
ls /sys/class/net/*
sudo cp /etc/network/interfaces /etc/network/interfaces.bak
sudo nano /etc/network/interfaces
auto lo
iface lo inet loopback
auto enp2s0
iface enp2s0 inet dhcp
auto wlp1s0
iface wlp1s0 inet static
hostapd /etc/hostapd/hostapd.conf
address 192.168.3.14
netmask 255.255.255.0
sudo cp /etc/hostapd/hostapd.conf /etc/hostapd/hostapd.conf.bak
sudo nano /etc/hostapd/hostapd.conf
interface=wlp1s0
driver=nl80211
ssid=test
hw_mode=g
channel=1
macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0
wpa=3
wpa_passphrase=1234567890
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP
sudo cp /etc/dnsmasq.conf /etc/dnsmasq.conf.bak
sudo rm /etc/dnsmasq.conf
sudo nano /etc/dnsmasq.conf
make it look like this:# Never forward plain names (without a #dot or domain part)
domain-needed
# Only listen for DHCP on wlan0
interface=wlp1s0
# create a domain if you want, comment #it out otherwise
# domain=Pi-Point.co.uk
# Create a dhcp range on your /24 wlp1s0 #network with 12 hour lease time
dhcp-range=192.168.3.15,192.168.3.254, 255.255.255.0,12h
sudo ifdown wlp1s0; sudo ifup wlp1s0; sudo service hostapd restart; sudo service dnsmasq restart
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies. Leaving this note here in case of future problems.]echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profileNow re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster. (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)th
Close the programexit
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.cd
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr. See link.export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),luarocks make mpit-mvapich-1.rockspec
sudo apt-get install htop
htop
on each machine and watch the CPU usage as you perform the following tests. If only the master node shows activity, you have a problem. mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.luaCheck that the MPI integration is working. Move down to the folder with the asynchronous algorithms.
cd asyncsgdI think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere. I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.luaTest bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd. I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node. As long as it works...? It seems to be checking the bandwidth through the cluster. There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.luaTry parallel mnist training - this is the one that should tell you what's up. AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned. If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua. In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
mkdir ~/projects cd projects git clone https://github.com/larspars/word-rnn.gitThat's actually all there is to it. Now cd into the word-rnn directory to run the test stuff. Before the tests and tools, though, there's a fix that you have to perform.
ssh beowulf@grendel-[X]
Recall that the password ishrunting
Create a password for the root account.sudo passwd root
hrunting
sudo passwd -u root
Note: if you ever feel like locking the root account again, run this:sudo passwd -l root
Now you have to allow the root user to login via ssh. Change an option in this file:sudo nano /etc/ssh/sshd_config
Find the line that says:PermitRootLogin prohibit-password
and comment it out (so you have a record of the default configuration) and add a new line below it. They should look like this:#PermitRootLogin prohibit-password
PermitRootLogin yes
[CTRL-O] and [CTRL-X] to exit, then run:sudo service ssh restart
That's it! Now we can use sshpass to automatically login to the machines and modify root files. Be careful; there is nothing between you and total destruction of your cluster.bash create_hosts_file.sh [MACHINE_COUNT] [PASSWORD] [HOSTNAME_BASE]
Where HOSTNAME_BASE is the standard part of the hostname of each computer - if the computers were named grendel-a, grendel-b, and grendel-c, then the base would be "grendel-".mkdir -p ~/scripts && cd scripts
wget https://raw.githubusercontent.com/umhau/cluster/master/create_hosts_file.sh
bash create_hosts_file.sh 5 "hrunting" "grendel-"
If you don't get any errors, then you're all set! You can check the files were created by ssh'ing into one of the machines and checking /etc/hosts.ssh beowulf@grendel-a
cat /etc/hosts
The output should look something like this:127.0.0.1 localhost
192.168.133.100 grendel-a
192.168.133.101 grendel-b
192.168.133.102 grendel-c
192.168.133.103 grendel-d
If it doesn't look like that, with a line for localhost and one line after it for each machine, you're in trouble. Google is your friend; it worked for me. sudo apt-get install nfs-server
sudo apt-get install nfs-client
Ok, we need to define a folder that can be standardized across all the machines: same idea as having a folder labeled "Dropbox" on each computer that you want your Dropbox account synced to - except in this case, the syncing is a little different. Anything you put in the /mirror folder of the master node will be shared across all the other computers, but anything you put in a /mirror folder of the other nodes will be ignored. That's why it's called a 'mirror' - there's a single folder that's being 'mirrored' by other folders.sudo mkdir /mirror
Now go back to the master machine, and tell it to share the /mirror folder to the network: add a line to the /etc/exports file, and then restart the service.echo "/mirror *(rw,sync)" | sudo tee -a /etc/exports
sudo service nfs-kernel-server restart
Maybe also add the following to the (rw,sync) options above:echo "grendel-a:/mirror /mirror nfs" | sudo tee -a /etc/fstab
That line is telling the OS a) to look for a drive located at grendel-a:/mirror, b) to mount it at the location /mirror, and c) that the drive is a 'network file system'. Remember that if you're using your own naming scheme to change 'grendel-a' to whatever the hostname of your master node is. sudo mount -a
useradd --uid 1010 -m -d /mirror/breca breca
Set a password. Run on all nodes.passwd breca
Add breca to the sudo group.sudo adduser breca sudo
Since 'breca' will be handling all the files in the /mirror directory, we'll make that user the owner. Run this only on the master node.sudo chown -R breca: /mirror
sudo apt-get install openssh-server
Next, login to the new user on the master node. su - breca
On the master node, generate an RSA key pair for the breca user. Keep the default location. If you feel like it, you can enter a 'strong' passphrase, but we've already been working under the assumption security isn't important here. Do what you like; nobody is going after your cluster (you hope). ssh-keygen -t rsa
Add the key to your 'authorized keys'.cd .ssh
cat id_rsa.pub >> authorized_keys
cd
And the nice thing is, what you've just done is being automatically mirrored to the other nodes.su - breca
After logging in to your breca account on all of your machines, test your passwordless ssh capabilities by running -- say, from your master node to your first slave node --ssh grendel-b
or from your second slave node into your master node:ssh grendel-a
The only thing you should have to do is type 'yes' to confirm that some kind of fingerprint is authentic, and that's a first-time-only sort of thing. However, because confirmation is requested, you have to perform the first login manually between each machine. Otherwise communication could/will fail. I haven't checked if it's necessary to ensure communication between slave nodes, so I did those too. sudo mount -a
exit
These are tools the cluster will need to perform computations. It's important to install all of this stuff prior to the MPICH2 software that ties it all together - I think the latter has to configure itself with reference to the available software.sudo apt-get install build-essential
I'm probably going to want Fortran as well, so I'm including that. sudo apt-get install gfortran
sudo apt-get install mpich
You can test that the install completed successfully by running:which mpiexec
which mpirun
The output should be:/usr/bin/mpiexec
and/usr/bin/mpirun
nproc
Next, log back into the breca user on the master node:su - breca
Create a new file in the /mirror directory of the master node and open it:touch machinefile && nano machinefile
The order of the machines in the file determines which will be accessed first. The format of the file lists the hostnames with the number of cores they have available.grendel-c:4
grendel-b:4
grendel-a:4
You might want to remove one of the master node's cores for control purposes. Who knows? Up for experimentation. I put the master node last for a similar reason. The other stuff can get tied up first. ssh beowulf@grendel-a
su - breca
cd into the /mirror folder.cd /mirror
Create a new file called mpi_hello.ctouch mpi_hello.c && nano mpi_hello.c
Put the following code into the file, and [ctrl-o] and [ctrl-x] to save and exit.#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
int myrank, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("Hello from processor %d of %d\n", myrank, nprocs);
MPI_Finalize();
return 0;
}
Compile the code with the custom MPI C compiler:mpicc mpi_hello.c -o mpi_hello
And run.mpiexec -n 11 -f ./machinefile ./mpi_hello
Here's a breakdown of the command:mpiexec command to execute an mpi-compatible binary
-n 11 the number of cores to ask for - this should not be
more than the sum of cores listed in the machinefile
-f ./machinefile the location of the machinefile
./mpi_hello the name of the binary to run
If all went as hoped, the output should look like this:Hello from processor 0 of 11
Hello from processor 1 of 11
Hello from processor 2 of 11
Hello from processor 3 of 11
Hello from processor 4 of 11
Hello from processor 5 of 11
Hello from processor 6 of 11
Hello from processor 7 of 11
Hello from processor 8 of 11
Hello from processor 10 of 11
Hello from processor 11 of 11
Make sure sum of the number of processors you listed in your machinefile corresponds to the number you asked for in the mpiexec command. sudo tasksel install manualThis command will find the install media, and 'install' the manual package selection 'package'. In other cases, it would actually install stuff - in this case, it just gives you a shell prompt. At this prompt, you somehow have full internet access - I don't know how it happened. It was automagical. In my case, the next step was editing the sources list to enable the universe repository (already seemed enabled - automagic of the tasksel command?). Source on the actual internet driver fix.
sudo nano /etc/apt/sources.listAnd a line that looks like this
deb http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
deb http://us.archive.ubuntu.com/ubuntu/ xenial main restricted universe
sudo apt-get install r8168-dkms
PACKAGENAME=<The name of the Package to install>and then
apt-get -qqs install $PACKAGENAME | grep Inst | awk '{print $2}' | xargs apt-cache show | grep 'Filename: ' | awk '{print $2}' | while read filepath; do echo "wget \"http://archive.ubuntu.com/ubuntu/${filepath}\""; done >downloader.sh
A ready-to-use downloader for the package has now been created in the home folder. Open your home directory in the file browser and move the file downloader.sh to the top-level directory of your flash drive. Then eject your flash drive.[CTRL]-L
[CTRL] CMove into the directory of the flash drive. In a terminal this time, type:
cd [CTRL]+[SHIFT]+VRun the downloader:
bash ./downloader.shWait for the download to complete and eject your flash drive.
[CTRL]-L
[CTRL]-CMove into the directory of the flash drive. In a terminal this time, type:
cd [CTRL]+[SHIFT]+V
sudo dpkg --install *.deb
That's it!hostname: grendel-a username: beowulf password: hrunting
lsblkFor example, this is the output when I run that command with a flash drive plugged in. Note on the right where it specifies the mount point (at /media/me/storage), and on the left where it shows me the name is sdb.
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 167.7G 0 disk
├─sda1 8:1 0 487M 0 part /boot
└─sda5 8:5 0 167.2G 0 part
├─mint--vg-root
│ 252:0 0 159.3G 0 lvm /
└─mint--vg-swap_1
252:1 0 7.9G 0 lvm [SWAP]
sdb 8:16 1 1.9G 0 disk
└─sdb1 8:17 1 1.9M 0 part /media/me/storage
If your USB is mounted, it has to be unmounted first - else weird things can happen in the next step. Trust me: once I didn't unmount a partition before copying it with dd, and my MBR was wiped out instead. In the example above, unmounting would work like this:umount "/media/me/storage"If we pretend your USB is the sdb device, this is the command you'd run (I'm assuming the ISO was saved to the default "~/Downloads" location). Swap out the 'sdb' part with what the lsblk command indicated. Also be aware that if you mess this up, you will probably destroy whatever computer you're running the command with. Just FYI.
sudo dd if=~/Downloads/ubuntu-16.04.1-server-amd64.iso of=/dev/sdb bs=4M status=progress
"how to choose startup device BIOS" & [your computer type]e.g., "ThinkPad T420".
sudo nano /etc/systemd/logind.conf
#HandleLidSwitch=suspend
HandleLidSwitch=ignore
sudo service systemd-logind restart
ping grendel-a
I'm switching over to github pages . The continuation of this blog (with archives included) is at umhau.github.io . By the way, the ...