Saturday, January 21, 2017

A Homemade Beowulf Cluster: Part 2, Machine Configuration

This section starts with a set of machines all tied together with an ethernet switch and running Ubuntu Server 16.04.1.  If the switch is plugged into the local network router, then the machines can be ssh'd into.

This should be picking up right where Part 1 left off. src.  So.

Enabling Scripted, Sudo Remote Access

The first step in the configuration process is to modify the root-owned host files on each machine.  I'm not doing that by hand, and I've already spent way too long trying to find a way to edit root-owned files through ssh automatically.

It's not possible without "security risks".  Since this is a local cluster, and my threat model doesn't include -- or care about -- people hacking in to the machines or me messing things up, I'm going the old fashioned way.  I also don't care about wiping my cluster accidentally, since I'm documenting the exact process I used to achieve it (and I'm making backups of any data I create).

Log into each machine in turn, and enter the password when prompted.
ssh beowulf@grendel-[X]
Recall that the password is
hrunting
Create a password for the root account.
sudo passwd root
At the prompt, enter your password.  We'll assume it's the same as the previously-defined user.
hrunting
Now the root account has a password, but it's still locked.  Time to unlock it.
sudo passwd -u root 
Note: if you ever feel like locking the root account again, run this:
sudo passwd -l root
Now you have to allow the root user to login via ssh.  Change an option in this file:
sudo nano /etc/ssh/sshd_config
Find the line that says:
PermitRootLogin prohibit-password
and comment it out (so you have a record of the default configuration) and add a new line below it. They should look like this:
#PermitRootLogin prohibit-password
PermitRootLogin yes
[CTRL-O] and [CTRL-X] to exit, then run:
sudo service ssh restart
That's it!  Now we can use sshpass to automatically login to the machines and modify root files.  Be careful; there is nothing between you and total destruction of your cluster.

Upload a custom /etc/hosts file to each machine

I created a script to do this for me.  If I could have found a simple way to set static IPs that would have been preferable, but this way I don't have to manually rebuild the file every time the cluster is restarted.

Note: for now, this isn't compatible with my example - it only uses node increments of digits, while my example is using letters (grendel-b vs grendel-1).  I'll fix that later.  For now, I'd recommend reading all the way to the end of the walkthrough before starting, and just using numbers for your node increments.

Run the script from a separate computer that's on the local network (i.e., that can ssh into the machines), but which isn't one of the machines in the cluster.  Usage of the script goes like this:
bash create_hosts_file.sh [MACHINE_COUNT] [PASSWORD] [HOSTNAME_BASE]
Where HOSTNAME_BASE is the standard part of the hostname of each computer - if the computers were named grendel-a, grendel-b, and grendel-c, then the base would be "grendel-".

So, continuing the example used throughout and pretending there's 5 machines in total, this is what the command would look like:
mkdir -p ~/scripts && cd scripts
wget https://raw.githubusercontent.com/umhau/cluster/master/create_hosts_file.sh
bash create_hosts_file.sh 5 "hrunting" "grendel-"
If you don't get any errors, then you're all set! You can check the files were created by ssh'ing into one of the machines and checking /etc/hosts.
ssh beowulf@grendel-a
cat /etc/hosts
The output should look something like this:
127.0.0.1     localhost
192.168.133.100 grendel-a
192.168.133.101 grendel-b
192.168.133.102 grendel-c
192.168.133.103 grendel-d
If it doesn't look like that, with a line for localhost and one line after it for each machine, you're in trouble.  Google is your friend; it worked for me.

Creating a Shared Folder Between Machines

This way, I can put my script with fancy high-powered code in one place, and all the machines will be able to access it.

First, dependencies.  Install this one just on the 'master' node/computer (generally, the most powerful computer in the cluster, and definitely the one you labelled #1).
sudo apt-get install nfs-server
Next, install this on all the other machines:
sudo apt-get install nfs-client
Ok, we need to define a folder that can be standardized across all the machines: same idea as having a folder labeled "Dropbox" on each computer that you want your Dropbox account synced to - except in this case, the syncing is a little different.  Anything you put in the /mirror folder of the master node will be shared across all the other computers, but anything you put in a /mirror folder of the other nodes will be ignored.  That's why it's called a 'mirror' - there's a single folder that's being 'mirrored' by other folders.

We'll put it in the root directory.  Since we're mirroring it across all the machines, call it 'mirror'. Do this on all the machines:
sudo mkdir /mirror
Now go back to the master machine, and tell it to share the /mirror folder to the network: add a line to the /etc/exports file, and then restart the service.
echo "/mirror *(rw,sync)" | sudo tee -a /etc/exports
sudo service nfs-kernel-server restart
Maybe also add the following to the (rw,sync) options above:

  • no_subtree_check: This option prevents the subtree checking. When a shared directory is the subdirectory of a larger filesystem, nfs performs scans of every directory above it, in order to verify its permissions and details. Disabling the subtree check may increase the reliability of NFS, but reduce security.
  • no_root_squash: This allows root account to connect to the folder.

Great!  Now there's a folder on the master node on the network that we can mount and automatically get stuff from.  Time to mount it.

There's two ways to go about this - one, we could manually mount on every reboot, or two, we could automatically mount the folder on each of the 'slave' nodes.  I like the second option better.

There's a file called the fstab in the /etc directory.  It means, 'file system tabulator'.  This is what the OS uses on startup to know which partitions to mount.  What we're going to do is add another entry to that file - on every startup, it'll know to mount the network folder and present it like another external drive.

On each non-master machine (i.e., all the slave machines) run this command to append a new entry to the bottom of the fstab file.  The bit in quotes is the part getting added.
echo "grendel-a:/mirror    /mirror    nfs" | sudo tee -a /etc/fstab
That line is telling the OS a) to look for a drive located at grendel-a:/mirror, b) to mount it at the location /mirror, and c) that the drive is a 'network file system'.  Remember that if you're using your own naming scheme to change 'grendel-a' to whatever the hostname of your master node is.

Now, in lieu of rebooting the machines, run this command on each slave machine to go back through the fstab and remount everything according to whatever it (now) says.
sudo mount -a

Establishing A Seamless Communication Protocol Between Machines

Create a new user

This user will be used specifically for performing computations.  If beowulf is the administrative user, and root is being used as the setting-stuff-up-remotely-via-automated-scripts user, then this is the day-to-day-heavy-computations user.

The home folder for this user will be inside /mirror, and it's going to be given the same userid across all the accounts (I picked '1010') - we're making it as identical as possible for the purposes of using all the machines in the cluster as a single computational device.

We'll call the new user 'breca'.  Just for giggles, let's make the password 'acerb'.  Run the first command on the master node first, and the slaves afterwards.
useradd --uid 1010 -m -d /mirror/breca breca
Set a password.  Run on all nodes.
passwd breca
Add breca to the sudo group.
sudo adduser breca sudo
Since 'breca' will be handling all the files in the /mirror directory, we'll make that user the owner.  Run this only on the master node.
sudo chown -R breca: /mirror

Setting up passwordless SSH for inter-node communication

Next, a dependency.  Install this to each node (master and slaves):
sudo apt­-get install openssh-server
Next, login to the new user on the master node.
su - breca
On the master node, generate an RSA key pair for the breca user.  Keep the default location.  If you feel like it, you can enter a 'strong' passphrase, but we've already been working under the assumption security isn't important here.  Do what you like; nobody is going after your cluster (you hope).
ssh-keygen -t rsa
Add the key to your 'authorized keys'.
cd .ssh
cat id_rsa.pub >> authorized_keys
cd
And the nice thing is, what you've just done is being automatically mirrored to the other nodes.

With that, you should have passwordless ssh communication between all of your nodes.  Login to your breca account on each machine:
su - breca
 After logging in to your breca account on all of your machines, test your passwordless ssh capabilities by running -- say, from your master node to your first slave node --
ssh grendel-b
or from your second slave node into your master node:
ssh grendel-a
The only thing you should have to do is type 'yes' to confirm that some kind of fingerprint is authentic, and that's a first-time-only sort of thing.  However, because confirmation is requested, you have to perform the first login manually between each machine.  Otherwise communication could/will fail.  I haven't checked if it's necessary to ensure communication between slave nodes, so I did those too.

Note that since the same known_hosts file is shared among all the machines, it's only ever necessary to confirm a machine once.  So you could just log into all the machines consecutively from the master node, and once into the master node from one of the slaves, and all the nodes would thereafter have seamless ssh communication.

Troubleshooting

This process worked for me, following this guide exactly, so there's no reason it wouldn't work for you as well.  If a package is changed since the time of writing, however, it may fail in the future.  See section 7 of this guide to set up a keychain, which is the likely solution.

If, after rebooting, you can no longer automatically log into your breca account within the node (master-to-slave, etc.) the /mirror mounting procedure may have been interrupted.  i.e., possibly a network disconnect when /etc/fstab was executed such that grendel-a:/mirror couldn't be found.  If that's the case, the machines can't connect without passwords because they don't have access to the RSA key stored in the missing /mirror/calc/.ssh directory.  Log into each of the affected machines and remount everything in the fstab.
sudo mount -a

Installing Software Tools

You've been in and out of the 'beowulf' and 'breca' user accounts while setting up ssh.  Now it's time to go back to the 'beowulf' account.  If you're still in the breca account, run:
exit
These are tools the cluster will need to perform computations.  It's important to install all of this stuff prior to the MPICH2 software that ties it all together - I think the latter has to configure itself with reference to the available software.

If you're going to be using any compilers besides GCC, this is the time to install them.

This installs GCC.  Run it on each computer.
sudo apt-get install build-essential
I'm probably going to want Fortran as well, so I'm including that.
sudo apt-get install gfortran

Installing MPICH

And now, what we've all been waiting for: the commands that will actually make these disparate machines act as a single cluster.  Run this on each machine:
sudo apt-get install mpich
You can test that the install completed successfully by running:
which mpiexec
which mpirun
The output should be:
/usr/bin/mpiexec
and
/usr/bin/mpirun

The Machinefile

This 'machinefile' tells the mpich software what computers to use for computations, and how many processors are on each of those computers.  The code you run on the cluster will specify how many processors it needs, and the master node (which uses the machinefile) will start at the top of the file and work downwards until it has found enough processors to fulfill the code's request.  

First, find out how many processors you have available on each machine (the output of this command will include virtual cores).  Run this on each machine.
nproc
Next, log back into the breca user on the master node:
su - breca
Create a new file in the /mirror directory of the master node and open it:
touch machinefile && nano machinefile 
The order of the machines in the file determines which will be accessed first.  The format of the file lists the hostnames with the number of cores they have available.
grendel-c:4
grendel-b:4
grendel-a:4
You might want to remove one of the master node's cores for control purposes.  Who knows?  Up for experimentation.  I put the master node last for a similar reason.  The other stuff can get tied up first.

You should be up and running!  What follows is a short test to make sure everything is actually up and running.

Testing the Configuration

Go to your master node, and log into the breca account.
ssh beowulf@grendel-a
su - breca
cd into the /mirror folder.
cd /mirror
Create a new file called mpi_hello.c
touch mpi_hello.c && nano mpi_hello.c
Put the following code into the file, and [ctrl-o] and [ctrl-x] to save and exit.
#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}
Compile the code with the custom MPI C compiler:
mpicc mpi_hello.c -o mpi_hello
And run.
mpiexec -n 11 -f ./machinefile ./mpi_hello
Here's a breakdown of the command:
mpiexec              command to execute an mpi-compatible binary

-n 11                the number of cores to ask for - this should not be
                     more than the sum of cores listed in the machinefile

-f ./machinefile     the location of the machinefile

./mpi_hello          the name of the binary to run
If all went as hoped, the output should look like this:
Hello from processor 0 of 11
Hello from processor 1 of 11
Hello from processor 2 of 11
Hello from processor 3 of 11
Hello from processor 4 of 11
Hello from processor 5 of 11
Hello from processor 6 of 11
Hello from processor 7 of 11
Hello from processor 8 of 11
Hello from processor 10 of 11
Hello from processor 11 of 11
Make sure sum of the number of processors you listed in your machinefile corresponds to the number you asked for in the mpiexec command.

Note that you can totally ask for more processors than you actually listed - the MPICH sofware will assign multiple threads to each core to fulfill the request.  It's not efficient, but better than errors.

And that's it!  You have a working, tested beowulf cluster.

No comments:

Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...