Friday, January 27, 2017

Command line system resource monitor

Shows cpu usage, memory, swap.
sudo apt-get install htop
htop

Show CPU info via command line

This gives a ton of information - way more that I generally ever need.
less /proc/cpuinfo
This is the tidy version.
lscpu
This is the min and max clock speed of the CPU:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq 
This is a cool command to keep track of the current CPU clock speed.
sudo watch -n 1  cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq


Thursday, January 26, 2017

Ubuntu Server 16.04 not detecting wifi card

This turns out to be a relatively simple issue.

This command is my go-to for internet connection diagnostics, but it wasn't showing my wifi card.
ifconfig
This shows all interfaces, not just the activated ones.
ifconfig -a
to activate a hardware device after determining its name, run:
ifconfig "device name" up
Bonus: this gives you way more than you'll ever need to know about the hardware capabilities of the device.
iw list

(Internet via Ethernet) + (SSH via wireless)

= (sustained SSH during MPICH computing)

I've been losing SSH connection after starting process jobs on my new beowulf cluster.  This is my current fix, since my theory is that the network switch is so clogged with MPI-related communication (which does take place via ssh) that there's no bandwidth left for my administrative SSH connection.  Theory supported by observation that when I plug my unrelated control machine into the switch it can't ping google.

assumptions

  • Ubuntu 16.04.1 LTS
  • working wireless and ethernet: I had to do this and this.

sources

check wifi hardware capability

Run the command 
iw list
And look for a section like the following.  If it includes 'AP' (see emboldened bit), you're golden.  If not, look for a different wireless card.
Supported interface modes:  
         * IBSS 
         * managed  
         * AP 
         * AP/VLAN  
         * monitor 

install dependencies

sudo apt-get install rfkill hostapd hostap-utils iw dnsmasq   

identify interface names

As of ubuntu 16.04, the standard wlan0 and eth0 interface names are no longer in use.  You'll have to identify them specifically.  Use the following command, which lists the contents of the folder for each interface device, and look for the device that has a folder named 'wireless'. src.
ls /sys/class/net/*
Observe the assumptions above to see what I'm calling them.

configure wifi settings

There's three files you'll have to configure.  Since I'm logged in via ssh, I don't want to interrupt my connection until I've created a new access point I can connect to.  So I'll walk through editing each file in turn, then I'll have one command at the end that activates all the changes.  

configure wireless interface: /etc/network/interfaces

Backup your current interface file.
sudo cp /etc/network/interfaces /etc/network/interfaces.bak
and then edit the original
sudo nano /etc/network/interfaces
replace the contents of the file - change the interface names as appropriate.
auto lo
iface lo inet loopback

auto enp2s0
iface enp2s0 inet dhcp

auto wlp1s0
iface wlp1s0 inet static
hostapd /etc/hostapd/hostapd.conf
address 192.168.3.14
netmask 255.255.255.0
Normally I'd say that here's where you restart the interface, but we're saving that for the end.

configure the access point: /etc/hostapd/hostapd.conf

backup the original file - it's ok if there's nothing there.
sudo cp /etc/hostapd/hostapd.conf /etc/hostapd/hostapd.conf.bak
edit the original
sudo nano /etc/hostapd/hostapd.conf
put this in:
interface=wlp1s0
driver=nl80211
ssid=test
hw_mode=g
channel=1
macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0
wpa=3
wpa_passphrase=1234567890
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP
Inexplicably, this only seems to produce a detectable wifi access point when the ssid is 'test'.  I tried several other non-keyword names, and none of them worked.  Go back to 'test', and it worked.  Did it several times...magic.

Save and exit.

configure the DHCP server

this is where the access point actually becomes something you can access. backup: 
sudo cp /etc/dnsmasq.conf /etc/dnsmasq.conf.bak
edit original - since the file is so big, I rm'd the original and pasted the contents below into an empty file.
sudo rm /etc/dnsmasq.conf
sudo nano /etc/dnsmasq.conf
make it look like this:
# Never forward plain names (without a #dot or domain part)
domain-needed

# Only listen for DHCP on wlan0
interface=wlp1s0

# create a domain if you want, comment #it out otherwise
# domain=Pi-Point.co.uk

# Create a dhcp range on your /24 wlp1s0 #network with 12 hour lease time
dhcp-range=192.168.3.15,192.168.3.254, 255.255.255.0,12h
Save and exit.

implement changes

this is going to be one big command. if it works, you're in business...if it doesn't, you'll have to login directly to the machine for troubleshooting.
sudo ifdown wlp1s0; sudo ifup wlp1s0; sudo service hostapd restart; sudo service dnsmasq restart
Worked for me: I now have a secondary wireless access to my beowulf cluster for when the ethernet gets clogged with MPI signals.



Wednesday, January 25, 2017

Basic Vim

The cheat sheets and guides out there don't seem to provide a practical intro to Vim.  I'm not able to use MS Code on one of my primary interfaces, so I'm looking for the next best thing.  Vim, so I've heard, is probably it.  This is a great little tutorial to introduce the basics.

There's two modes: command mode, and insert mode.  Command mode is where you do things that would normally be accessed via cursor, arrow keys or a menu, and insert mode is where you type letters and they appear on the screen and you can use the arrow keys like you're used to.  When you open vim, you start in command mode.

This should get you to about a nano level of proficiency.

Basic Usage

open foo  | vim foo
save file | :w
quit file | :q
Command mode | [ESC]
                                  |   k
Move cursor left, right, up, down | h   l
                                  |   j
Insert here | i
Insert new line below | o
Delete char under cursor | x
Here's a nice cheat sheet for further use.

Tuesday, January 24, 2017

RNN with Torch and MPI

This is being installed on machines running Ubuntu Server 16.04.1 LTS.  Does not work on Linux Mint (the torch install script doesn't detect that OS).

Most of the following installations have to be performed on each computer.  I didn't re-download everything, since it was going to be put in the same place, but I did cd in and re-run the installation procedure.  That ensured the necessary files were added to all the right places elsewhere in the system.

Here, I'm walking through the process of running Torch on a cluster.  CPUs, not GPUs.  The performance benefit comes from the slave nodes being allowed greater latitude in searching for local optima to 'solve' the neural net.  Every so often, they 'touch base' with the master node and synchronize the result of their computations.  Read the abstract of Sixin Zhang's paper to get a more detailed idea of what's happening.  As far as the implementation goes, "the idea is to transform the torch data structure (tensor, table etc) into a storage (contiguous in memory) and then send/recv [sic] it." src.

Background Sources

Keep track of where I found the info I used to figure this out.

https://bbs.archlinux.org/viewtopic.php?id=159999
http://torch.ch/docs/getting-started.html
https://groups.google.com/forum/#!topic/torch7/Xs814a5_xgI

Set up MPI (beowulf cluster)

Follow the instructions in these two posts first.  They get you to the point of a working cluster, starting from a collection of unused PCs and the relevant hardware.

https://nixingaround.blogspot.com/2017/01/a-homebrew-beowulf-cluster-part-1.html
https://nixingaround.blogspot.com/2017/01/a-homemade-beowulf-cluster-part-2.html

prevent SSH from losing connection

I had some trouble here, where I was trying to use ssh over the same wires that were providing MPI communication in the cluster.  I kept losing connection after initializing the computations.  It may not be necessary, so I wouldn't do this unless you run into trouble of that sort.  

https://nixingaround.blogspot.com/2017/01/internet-via-ethernet-ssh-via-wireless.html

Ok, that's not an optimal solution. Better to initialize a virtual terminal and run the computations in that.  When the connection is inevitably dropped, just recover that terminal.

http://unix.stackexchange.com/questions/22781/how-to-recover-a-shell-after-a-disconnection

Install Torch

Note: it may be useful to install the MKL library ahead of torch.  It accelerates the math routines that I assume will be present in the computations I'm going to perform.  

This provides dependencies needed to install the mpiT package that lets Torch7 work with MPI.  Start in the breca home directory.  On the master node, run the following.
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:
cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies.  Leaving this note here in case of future problems.]

After the install script finished running, it told me that it had not updated my shell profile.  So, we're adding a line to the ~/.profile script.  (we're using that, and not the bashrc file, because when logging on to the breca account bash isn't automatically run.  If I ever forget and try to use Torch without bash, I could run into problems this can avoid.)

Do the following on all nodes:
echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profile
Now re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster.  (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)

Test that Torch has been installed:
th
Close the program
exit

MPI compatibility with Torch

Source: https://github.com/sixin-zh/mpiT

Do this on the master node. You'll be able to access the downloaded files from all the nodes - they're going in the /mirror directory. Download from github and install.
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.
cd 
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr.  See link.
export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),
luarocks make mpit-mvapich-1.rockspec

Tests

First, figure out how many processors you have.  You did already; that's the sum of the numbers in your machinefile in the /mirror directory.  We'll say you have 12 cores.  Since our counting starts at 0, tell the computer you have 11.  Adjust according to your actual situation. 

Next, use a bunch of terminals and log into each of your nodes simultaneously.  Install:
sudo apt-get install htop 
And run
htop
on each machine and watch the CPU usage as you perform the following tests.  If only the master node shows activity, you have a problem.  

Create ./data/torch7 in the home directory, and then download the test data to that location.  Ensure you're logged in as the MPI user.
mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
Now run the tests. Sanity check: did mpiT install successfully? Note: I ran into an 'error 75' at this point, and the solution was to explicitly define the location of the files involved starting from the root directory. 
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.lua
Check that the MPI integration is working.  Move down to the folder with the asynchronous algorithms.
cd asyncsgd
I think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere.  I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.lua
Test bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd.  I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node.  As long as it works...?  It seems to be checking the bandwidth through the cluster.  There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.lua 
Try parallel mnist training - this is the one that should tell you what's up.  AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned.  If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua.  In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
and this is as far as I've actually made it without errors (up to this point, barring abnormalities in the PCs used, everything works perfectly for me).

Install Word RNN

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Saturday, January 21, 2017

A Homemade Beowulf Cluster: Part 2, Machine Configuration

This section starts with a set of machines all tied together with an ethernet switch and running Ubuntu Server 16.04.1.  If the switch is plugged into the local network router, then the machines can be ssh'd into.

This should be picking up right where Part 1 left off. src.  So.

Enabling Scripted, Sudo Remote Access

The first step in the configuration process is to modify the root-owned host files on each machine.  I'm not doing that by hand, and I've already spent way too long trying to find a way to edit root-owned files through ssh automatically.

It's not possible without "security risks".  Since this is a local cluster, and my threat model doesn't include -- or care about -- people hacking in to the machines or me messing things up, I'm going the old fashioned way.  I also don't care about wiping my cluster accidentally, since I'm documenting the exact process I used to achieve it (and I'm making backups of any data I create).

Log into each machine in turn, and enter the password when prompted.
ssh beowulf@grendel-[X]
Recall that the password is
hrunting
Create a password for the root account.
sudo passwd root
At the prompt, enter your password.  We'll assume it's the same as the previously-defined user.
hrunting
Now the root account has a password, but it's still locked.  Time to unlock it.
sudo passwd -u root 
Note: if you ever feel like locking the root account again, run this:
sudo passwd -l root
Now you have to allow the root user to login via ssh.  Change an option in this file:
sudo nano /etc/ssh/sshd_config
Find the line that says:
PermitRootLogin prohibit-password
and comment it out (so you have a record of the default configuration) and add a new line below it. They should look like this:
#PermitRootLogin prohibit-password
PermitRootLogin yes
[CTRL-O] and [CTRL-X] to exit, then run:
sudo service ssh restart
That's it!  Now we can use sshpass to automatically login to the machines and modify root files.  Be careful; there is nothing between you and total destruction of your cluster.

Upload a custom /etc/hosts file to each machine

I created a script to do this for me.  If I could have found a simple way to set static IPs that would have been preferable, but this way I don't have to manually rebuild the file every time the cluster is restarted.

Note: for now, this isn't compatible with my example - it only uses node increments of digits, while my example is using letters (grendel-b vs grendel-1).  I'll fix that later.  For now, I'd recommend reading all the way to the end of the walkthrough before starting, and just using numbers for your node increments.

Run the script from a separate computer that's on the local network (i.e., that can ssh into the machines), but which isn't one of the machines in the cluster.  Usage of the script goes like this:
bash create_hosts_file.sh [MACHINE_COUNT] [PASSWORD] [HOSTNAME_BASE]
Where HOSTNAME_BASE is the standard part of the hostname of each computer - if the computers were named grendel-a, grendel-b, and grendel-c, then the base would be "grendel-".

So, continuing the example used throughout and pretending there's 5 machines in total, this is what the command would look like:
mkdir -p ~/scripts && cd scripts
wget https://raw.githubusercontent.com/umhau/cluster/master/create_hosts_file.sh
bash create_hosts_file.sh 5 "hrunting" "grendel-"
If you don't get any errors, then you're all set! You can check the files were created by ssh'ing into one of the machines and checking /etc/hosts.
ssh beowulf@grendel-a
cat /etc/hosts
The output should look something like this:
127.0.0.1     localhost
192.168.133.100 grendel-a
192.168.133.101 grendel-b
192.168.133.102 grendel-c
192.168.133.103 grendel-d
If it doesn't look like that, with a line for localhost and one line after it for each machine, you're in trouble.  Google is your friend; it worked for me.

Creating a Shared Folder Between Machines

This way, I can put my script with fancy high-powered code in one place, and all the machines will be able to access it.

First, dependencies.  Install this one just on the 'master' node/computer (generally, the most powerful computer in the cluster, and definitely the one you labelled #1).
sudo apt-get install nfs-server
Next, install this on all the other machines:
sudo apt-get install nfs-client
Ok, we need to define a folder that can be standardized across all the machines: same idea as having a folder labeled "Dropbox" on each computer that you want your Dropbox account synced to - except in this case, the syncing is a little different.  Anything you put in the /mirror folder of the master node will be shared across all the other computers, but anything you put in a /mirror folder of the other nodes will be ignored.  That's why it's called a 'mirror' - there's a single folder that's being 'mirrored' by other folders.

We'll put it in the root directory.  Since we're mirroring it across all the machines, call it 'mirror'. Do this on all the machines:
sudo mkdir /mirror
Now go back to the master machine, and tell it to share the /mirror folder to the network: add a line to the /etc/exports file, and then restart the service.
echo "/mirror *(rw,sync)" | sudo tee -a /etc/exports
sudo service nfs-kernel-server restart
Maybe also add the following to the (rw,sync) options above:

  • no_subtree_check: This option prevents the subtree checking. When a shared directory is the subdirectory of a larger filesystem, nfs performs scans of every directory above it, in order to verify its permissions and details. Disabling the subtree check may increase the reliability of NFS, but reduce security.
  • no_root_squash: This allows root account to connect to the folder.

Great!  Now there's a folder on the master node on the network that we can mount and automatically get stuff from.  Time to mount it.

There's two ways to go about this - one, we could manually mount on every reboot, or two, we could automatically mount the folder on each of the 'slave' nodes.  I like the second option better.

There's a file called the fstab in the /etc directory.  It means, 'file system tabulator'.  This is what the OS uses on startup to know which partitions to mount.  What we're going to do is add another entry to that file - on every startup, it'll know to mount the network folder and present it like another external drive.

On each non-master machine (i.e., all the slave machines) run this command to append a new entry to the bottom of the fstab file.  The bit in quotes is the part getting added.
echo "grendel-a:/mirror    /mirror    nfs" | sudo tee -a /etc/fstab
That line is telling the OS a) to look for a drive located at grendel-a:/mirror, b) to mount it at the location /mirror, and c) that the drive is a 'network file system'.  Remember that if you're using your own naming scheme to change 'grendel-a' to whatever the hostname of your master node is.

Now, in lieu of rebooting the machines, run this command on each slave machine to go back through the fstab and remount everything according to whatever it (now) says.
sudo mount -a

Establishing A Seamless Communication Protocol Between Machines

Create a new user

This user will be used specifically for performing computations.  If beowulf is the administrative user, and root is being used as the setting-stuff-up-remotely-via-automated-scripts user, then this is the day-to-day-heavy-computations user.

The home folder for this user will be inside /mirror, and it's going to be given the same userid across all the accounts (I picked '1010') - we're making it as identical as possible for the purposes of using all the machines in the cluster as a single computational device.

We'll call the new user 'breca'.  Just for giggles, let's make the password 'acerb'.  Run the first command on the master node first, and the slaves afterwards.
useradd --uid 1010 -m -d /mirror/breca breca
Set a password.  Run on all nodes.
passwd breca
Add breca to the sudo group.
sudo adduser breca sudo
Since 'breca' will be handling all the files in the /mirror directory, we'll make that user the owner.  Run this only on the master node.
sudo chown -R breca: /mirror

Setting up passwordless SSH for inter-node communication

Next, a dependency.  Install this to each node (master and slaves):
sudo apt­-get install openssh-server
Next, login to the new user on the master node.
su - breca
On the master node, generate an RSA key pair for the breca user.  Keep the default location.  If you feel like it, you can enter a 'strong' passphrase, but we've already been working under the assumption security isn't important here.  Do what you like; nobody is going after your cluster (you hope).
ssh-keygen -t rsa
Add the key to your 'authorized keys'.
cd .ssh
cat id_rsa.pub >> authorized_keys
cd
And the nice thing is, what you've just done is being automatically mirrored to the other nodes.

With that, you should have passwordless ssh communication between all of your nodes.  Login to your breca account on each machine:
su - breca
 After logging in to your breca account on all of your machines, test your passwordless ssh capabilities by running -- say, from your master node to your first slave node --
ssh grendel-b
or from your second slave node into your master node:
ssh grendel-a
The only thing you should have to do is type 'yes' to confirm that some kind of fingerprint is authentic, and that's a first-time-only sort of thing.  However, because confirmation is requested, you have to perform the first login manually between each machine.  Otherwise communication could/will fail.  I haven't checked if it's necessary to ensure communication between slave nodes, so I did those too.

Note that since the same known_hosts file is shared among all the machines, it's only ever necessary to confirm a machine once.  So you could just log into all the machines consecutively from the master node, and once into the master node from one of the slaves, and all the nodes would thereafter have seamless ssh communication.

Troubleshooting

This process worked for me, following this guide exactly, so there's no reason it wouldn't work for you as well.  If a package is changed since the time of writing, however, it may fail in the future.  See section 7 of this guide to set up a keychain, which is the likely solution.

If, after rebooting, you can no longer automatically log into your breca account within the node (master-to-slave, etc.) the /mirror mounting procedure may have been interrupted.  i.e., possibly a network disconnect when /etc/fstab was executed such that grendel-a:/mirror couldn't be found.  If that's the case, the machines can't connect without passwords because they don't have access to the RSA key stored in the missing /mirror/calc/.ssh directory.  Log into each of the affected machines and remount everything in the fstab.
sudo mount -a

Installing Software Tools

You've been in and out of the 'beowulf' and 'breca' user accounts while setting up ssh.  Now it's time to go back to the 'beowulf' account.  If you're still in the breca account, run:
exit
These are tools the cluster will need to perform computations.  It's important to install all of this stuff prior to the MPICH2 software that ties it all together - I think the latter has to configure itself with reference to the available software.

If you're going to be using any compilers besides GCC, this is the time to install them.

This installs GCC.  Run it on each computer.
sudo apt-get install build-essential
I'm probably going to want Fortran as well, so I'm including that.
sudo apt-get install gfortran

Installing MPICH

And now, what we've all been waiting for: the commands that will actually make these disparate machines act as a single cluster.  Run this on each machine:
sudo apt-get install mpich
You can test that the install completed successfully by running:
which mpiexec
which mpirun
The output should be:
/usr/bin/mpiexec
and
/usr/bin/mpirun

The Machinefile

This 'machinefile' tells the mpich software what computers to use for computations, and how many processors are on each of those computers.  The code you run on the cluster will specify how many processors it needs, and the master node (which uses the machinefile) will start at the top of the file and work downwards until it has found enough processors to fulfill the code's request.  

First, find out how many processors you have available on each machine (the output of this command will include virtual cores).  Run this on each machine.
nproc
Next, log back into the breca user on the master node:
su - breca
Create a new file in the /mirror directory of the master node and open it:
touch machinefile && nano machinefile 
The order of the machines in the file determines which will be accessed first.  The format of the file lists the hostnames with the number of cores they have available.
grendel-c:4
grendel-b:4
grendel-a:4
You might want to remove one of the master node's cores for control purposes.  Who knows?  Up for experimentation.  I put the master node last for a similar reason.  The other stuff can get tied up first.

You should be up and running!  What follows is a short test to make sure everything is actually up and running.

Testing the Configuration

Go to your master node, and log into the breca account.
ssh beowulf@grendel-a
su - breca
cd into the /mirror folder.
cd /mirror
Create a new file called mpi_hello.c
touch mpi_hello.c && nano mpi_hello.c
Put the following code into the file, and [ctrl-o] and [ctrl-x] to save and exit.
#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}
Compile the code with the custom MPI C compiler:
mpicc mpi_hello.c -o mpi_hello
And run.
mpiexec -n 11 -f ./machinefile ./mpi_hello
Here's a breakdown of the command:
mpiexec              command to execute an mpi-compatible binary

-n 11                the number of cores to ask for - this should not be
                     more than the sum of cores listed in the machinefile

-f ./machinefile     the location of the machinefile

./mpi_hello          the name of the binary to run
If all went as hoped, the output should look like this:
Hello from processor 0 of 11
Hello from processor 1 of 11
Hello from processor 2 of 11
Hello from processor 3 of 11
Hello from processor 4 of 11
Hello from processor 5 of 11
Hello from processor 6 of 11
Hello from processor 7 of 11
Hello from processor 8 of 11
Hello from processor 10 of 11
Hello from processor 11 of 11
Make sure sum of the number of processors you listed in your machinefile corresponds to the number you asked for in the mpiexec command.

Note that you can totally ask for more processors than you actually listed - the MPICH sofware will assign multiple threads to each core to fulfill the request.  It's not efficient, but better than errors.

And that's it!  You have a working, tested beowulf cluster.

Thursday, January 19, 2017

When Ubuntu Server doesn't let you keep the internet device drivers

So I ran into a problem while setting up Ubuntu Server 16.04.1 LTS for my cluster: one of my computers had restricted hardware drivers for the wifi and ethernet devices, and while the drivers were available on the install media, they weren't transferred over.  Something about non-free stuff.  So after installation, I was stuck with a computer that had no access to the web.

After a ton of searching, I a) found what the hardware problem was, and b) realized that I can use the install media as the installation source - that might sound obvious, but the keyword there is "can", not "should be able to". It's badly documented and very not-obvious.

Anyway, here's the source for using the install media.  Part of the install process is choosing which sets of packages you want to include - things like OpenSSH Server stuff or Mail Server stuff.  There's also one called Manual Package Selection.  It didn't seem to do anything during the install (though I did walk away, and it might have timed out), but you can reenter the tool after the installation is finished and you've rebooted into the new OS.

After rebooting, log in and plug the ubuntu server install media into the computer.  Run:
sudo tasksel install manual
This command will find the install media, and 'install' the manual package selection 'package'.  In other cases, it would actually install stuff - in this case, it just gives you a shell prompt.  At this prompt, you somehow have full internet access - I don't know how it happened.  It was automagical.  In my case, the next step was editing the sources list to enable the universe repository (already seemed enabled - automagic of the tasksel command?).  Source on the actual internet driver fix.
sudo nano /etc/apt/sources.list
And a line that looks like this
deb http://us.archive.ubuntu.com/ubuntu/ xenial main restricted
Should be changed to look like this:
deb http://us.archive.ubuntu.com/ubuntu/ xenial main restricted universe
[ctrl]-x to leave.  Then, run 
sudo apt-get install r8168-dkms
And the driver is installed.  Reboot and the internet should work.

Install a package without internet

So first of all, this isn't original.  Credit goes here.  But it's fantastic, and I wish I'd known about this a long time ago.  As usual, for my own memory/use: and actually, I'm just going to clean up what the other guy said.  He did a great job.

On the Internet-less computer:

In the terminal enter:
PACKAGENAME=<The name of the Package to install>
and then
apt-get -qqs install $PACKAGENAME | grep Inst | awk '{print $2}' | xargs apt-cache show | grep 'Filename: ' | awk '{print $2}' | while read filepath; do echo "wget \"http://archive.ubuntu.com/ubuntu/${filepath}\""; done >downloader.sh
A ready-to-use downloader for the package has now been created in the home folder.  Open your home directory in the file browser and move the file downloader.sh to the top-level directory of your flash drive.  Then eject your flash drive.

On the computer with Internet:

Insert your flash drive, and open your flash drive in the file browser.  Copy the location of your flash drive:
[CTRL]-L
[CTRL] C 
Move into the directory of the flash drive.  In a terminal this time, type:
cd [CTRL]+[SHIFT]+V 
Run the downloader:
bash ./downloader.sh
Wait for the download to complete and eject your flash drive.

Back to the Internet-less computer:

Open your flash drive in the file browser.  In the browser, type the following to copy the file location of the flash drive.
[CTRL]-L
[CTRL]-C
Move into the directory of the flash drive.  In a terminal this time, type:
cd [CTRL]+[SHIFT]+V 
sudo dpkg --install *.deb
That's it!

Wednesday, January 18, 2017

A Homemade Beowulf Cluster: Part 1, Hardware Assembly

A beowulf cluster lets me tie miscellaneous computers together and use their cpus like one large processor...I think.  Never done this before, still working on the details.

I'm building this with random laptops: generally i5s, and I think one's a Core Duo - it's half decent thing from 2011.  Might even throw in an RPi2 for good measure.

Make sure you read through this before starting.  You want to know what you're getting into.  Watch out, though - this is a long post.

Notes

Since we're working on multiple computers here, not everything is going to be cut-and-paste.  I will make sure that it's as clear as possible, however.  There won't be any hand-waving or assumptions of prior knowledge.

I'm doing this with Linux Mint 18 on my primary laptop.

Primary Sources

Setting up the cluster: src 1,  src 2, src 3

Hardware Ingredients

  • Since the benefit of this tool is sharing computations between computers, you need a way to route that information.  Hence, an ethernet switch.  Go for a gigabit, since you don't want the switch to be your bottleneck.  I was cheap and got myself the 5-port version, and I'm already kicking myself.  Go for the 8-port version at least.  Here's the 5-port version I got, for consistency's sake. 
  • A ton of ethernet cables.  You can do with short ones, but you'll want to connect the switch to your router so you can access the computers from outside their own tiny network.
  • Leftover computers.  You won't be using these for anything else, so make sure you don't need them. 

Computer Preparation

Install Ubuntu Server to each computer. I used 16.04.1 LTS.  For those with a penchant for funny names (or who need to deal with annoying, obtuse colloquialisms on the web), that's the Xenial Xerus edition.

You'll need to keep track of computer names ("hostnames"), and install consistent users (so the username and password are the same on each machine).  I like to increment the computer names with letters.  For the purposes of this guide, we'll use:
hostname:    grendel-a
username:    beowulf
password:    hrunting

Create installation USB flash drive

Go here to download the 16.04.1 LTS server ISO.  That link initiates the download; here's a bit of context.

Burn the ISO to a USB flash drive. Note that you'll lose everything on the USB you use.  Run the following command twice, once before plugging in your USB and then a few seconds after.  The new entry when you run it the second time is your flash drive.
lsblk
For example, this is the output when I run that command with a flash drive plugged in.  Note on the right where it specifies the mount point (at /media/me/storage), and on the left where it shows me the name is sdb.
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    0 167.7G  0 disk 
├─sda1        8:1    0   487M  0 part /boot
└─sda5        8:5    0 167.2G  0 part 
  ├─mint--vg-root
  │         252:0    0 159.3G  0 lvm  /
  └─mint--vg-swap_1
            252:1    0   7.9G  0 lvm  [SWAP]
sdb           8:16   1   1.9G  0 disk 
└─sdb1        8:17   1   1.9M  0 part /media/me/storage
If your USB is mounted, it has to be unmounted first - else weird things can happen in the next step.  Trust me: once I didn't unmount a partition before copying it with dd, and my MBR was wiped out instead.  In the example above, unmounting would work like this:
umount "/media/me/storage"
If we pretend your USB is the sdb device, this is the command you'd run (I'm assuming the ISO was saved to the default "~/Downloads" location).  Swap out the 'sdb' part with what the lsblk command indicated.  Also be aware that if you mess this up, you will probably destroy whatever computer you're running the command with.  Just FYI.
sudo dd if=~/Downloads/ubuntu-16.04.1-server-amd64.iso of=/dev/sdb bs=4M status=progress

Install to each computer

You're gonna have to follow this procedure with each computer in the cluster.  There isn't a simple way around it, that I'm aware of.

Plug in the flash drive, and reboot/turn on the computer.  Make sure it's connected to the web.  Press ESC, F1, F2, F11 or F12 to choose a startup device...if those don't work, Google:
"how to choose startup device BIOS" & [your computer type]
e.g., "ThinkPad T420".

When it boots up, there's a few settings to make sure of.  Most of them should be straightforward: choosing a keyboard layout and a default language, for instance.  Just in case of problems, here's a full walkthrough.
All you want is a simple installation.
Don't bother with detecting the layout...if you're in the US or have an english keyboard, you can just stick with the defaults.  Worst case, start over.  There's a few more images after that one to do keyboard stuff, but you get the idea.
grendel-a, grendel-b, etc.  Makes it easy to keep the computers straight. After you've entered the hostname as above, press [enter].
I'm using the same thing for the full name and the username.  Makes it simple. Press [enter].
Since the username is the same as the full name, just press [enter].
That's the standard password for the computers.  I used the [down arrow] and pressed [space] to select that radio button and show the password.  Just press [enter].  It'll ask you to reenter the password for verification; do so, and press [enter] again.
Don't bother with encryption; this is a cluster designed for speed, and that just makes disk access slower.  For the next screen, check your time zone.  If it's wrong, the adjustment screen is pretty intuitive.
Press [enter] here to go with the default option.  You want to use the entire disk, and LVM could be useful (it's also an incredible pain).  Press [enter] again on the next screen after verifying the destination for the install.  It's probably the largest drive - the small one is probably your flash drive.

It asks for confirmation; [right arrow] and [enter].
I like to use a standard size for the installation that leaves lots of room to spare for other things on the drives.  Might be able to do some kind of shared network storage for the computations with the rest.  If it lets you, go for 50.0 GB.   [enter].

Another confirmation.  [right arrow] + [enter].

Unless you have weird internet, leave this blank and press [enter].  If you do have a weird internet setting, then I can't help you.
I skipped the auto updates, since they won't positively impact the performance of the cluster, and might slow it down at times.
Ok, this is the one that really matters.  Addendum: also select 'Manual Package Selection'.  For some machines with obscure drivers, this seems to ensure the drivers get installed.  You need the OpenSSH server in order to communicate with your cluster via the terminal. (at least, I'm pretty sure you do.  I didn't want to spend the time to find out for sure.)  [arrow down] to it, then [space] to select.  [enter] to move on.
Yes; you want GRUB installed to the MBR.  [enter].
...and that's it.  Reboot, remove the flash drive, and the new OS is installed.

Final Adjustment: power management

There is one more thing, though: you'll be running a bunch of these computers in the cluster, and you don't want to have to deal with them individually.  For one thing, that would be a huge mess.  So to keep them consolidated, you'll want to keep their lids shut while they're running: and if you don't change a setting, they'll just go to sleep when you close the lid.  src
sudo nano /etc/systemd/logind.conf
Find the line:
#HandleLidSwitch=suspend
and change it to:
HandleLidSwitch=ignore
Then restart that bit of the system:
sudo service systemd-logind restart

Assembling the Cluster Hardware

This is pretty straightforward: start with your ethernet switch.
  1. Run a cable from your router to the #1 port on the switch.
  2. Run a cable from each computer in the cluster to a port on the switch.  I'm not going to try and correlate port numbers and computer numbers.  Pretty sure it doesn't matter.
And that's all it takes to assemble the cluster hardware!  You should be able to ping your computers and get a response:
ping grendel-a
If you don't get replies, you have a problem.  Google is your friend.  Worked for me.

Sunday, January 15, 2017

A Recurrent Neural Network to generate sentences: step-by-step assembly (Mint version)

This one predicts (ok, procedurally generates) text word-by-word.  Others do it letter-by-letter, but then you have to deal with misspellings.  src. Inspiration.

Dependencies

I'm basing this off an installation script referenced in the github readme (here).  Instead of blindly running the script, I checked its contents and discovered it fails when run on linux mint.  So I simplified it, and changed/added a thing or two since it wasn't working anyway.  In case anyone wants to blindly run my version, here it is all cleaned up.

Updating your sources is always a good idea prior to installing a bunch of software.  Also, the package below makes it easy to add new repositories (see below).
sudo apt-get update
sudo apt-get install -y python-software-properties
Now things are getting serious.  This stuff includes GCC 4.9, since it seems GCC 5 isn't compatible with Torch 7.  There was a weird error where cmake wasn't installed when I put all the packages in the same apt-get install command - I split out the packages to their own lines, ran my install script again, and it worked.  Very odd.  Hence, my inefficient code below.
sudo apt-get install -y build-essential 
sudo apt-get install -y gcc 
sudo apt-get install -y g++ 
sudo apt-get install -y curl 
sudo apt-get install -y cmake 
sudo apt-get install -y libreadline-dev 
sudo apt-get install -y git-core 
sudo apt-get install -y libqt4-core 
sudo apt-get install -y libqt4-gui 
sudo apt-get install -y libqt4-dev 
sudo apt-get install -y libjpeg-dev 
sudo apt-get install -y libpng-dev 
sudo apt-get install -y ncurses-dev 
sudo apt-get install -y imagemagick 
sudo apt-get install -y libzmq3-dev 
sudo apt-get install -y gfortran 
sudo apt-get install -y unzip 
sudo apt-get install -y gnuplot 
sudo apt-get install -y gnuplot-x11 
sudo apt-get install -y ipython
sudo apt-get install -y libpcre3-dev
These are for a program called OpenBLAS, which stands for Basic Linear Algebra Subprograms.  Here's a bit more on the concept.
sudo apt-get install -y libopenblas-dev liblapack-dev
Install Torch (finally).  Note that when running the ./install.sh command, I'm letting bash automatically enter "Y" when prompted for a user response.  It's requesting permission to add torch to the system path in the .bashrc: essentially, creating a system shortcut for each initialization of the bash terminal.
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; 
echo "Y" | ./install.sh 
source ~/.bashrc
These tools are installed with Lua, a programming language that came with Torch.
/home/$USER/torch/install/bin/luarocks install nngraph
/home/$USER/torch/install/bin/luarocks install nninit 
/home/$USER/torch/install/bin/luarocks install optim
/home/$USER/torch/install/bin/luarocks install nn
/home/$USER/torch/install/bin/luarocks install underscore.lua --from=http://marcusirven.s3.amazonaws.com/rocks/
sudo /home/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/lib/x86_64-linux-gnu/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/ 
Make sure to add Torch to your PATH.  This ensures that when you run the command to use Torch, your computer can actually find the program.
to_path="/home/$USER/torch/install/bin"
echo "PATH=$PATH:$to_path" >> /home/$USER/.bashrc
source ~/.bashrc
With that, all dependencies should be satisfied. Time to install.

Installation

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Extra Fixes

I'm running torch on a gpu-less computer.  There's a glitch that occurs when running the test script in that scenario.  To avert it, you have to change the name of a function from CudaTensor to Tensor.  See here for details.  
cd word-rnn/util
nano SharedDropout.lua
The third line should look like this:
SharedDropout_noise = torch.CudaTensor()
Change it to this:
SharedDropout_noise = torch.Tensor()
And save.
CTRL + O
CTRL + X

Running word-rnn

Now you can run the test function.  Be aware that on a 4-core 2.8GHz i7 processor, this command took 21 hours to complete.  BUT it's a very cool command, and the result is amazing.
th train.lua -gpuid -1
To be continued...

Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...