Showing posts with label CPU. Show all posts
Showing posts with label CPU. Show all posts

Friday, January 27, 2017

Show CPU info via command line

This gives a ton of information - way more that I generally ever need.
less /proc/cpuinfo
This is the tidy version.
lscpu
This is the min and max clock speed of the CPU:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq 
This is a cool command to keep track of the current CPU clock speed.
sudo watch -n 1  cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq


Saturday, January 21, 2017

A Homemade Beowulf Cluster: Part 2, Machine Configuration

This section starts with a set of machines all tied together with an ethernet switch and running Ubuntu Server 16.04.1.  If the switch is plugged into the local network router, then the machines can be ssh'd into.

This should be picking up right where Part 1 left off. src.  So.

Enabling Scripted, Sudo Remote Access

The first step in the configuration process is to modify the root-owned host files on each machine.  I'm not doing that by hand, and I've already spent way too long trying to find a way to edit root-owned files through ssh automatically.

It's not possible without "security risks".  Since this is a local cluster, and my threat model doesn't include -- or care about -- people hacking in to the machines or me messing things up, I'm going the old fashioned way.  I also don't care about wiping my cluster accidentally, since I'm documenting the exact process I used to achieve it (and I'm making backups of any data I create).

Log into each machine in turn, and enter the password when prompted.
ssh beowulf@grendel-[X]
Recall that the password is
hrunting
Create a password for the root account.
sudo passwd root
At the prompt, enter your password.  We'll assume it's the same as the previously-defined user.
hrunting
Now the root account has a password, but it's still locked.  Time to unlock it.
sudo passwd -u root 
Note: if you ever feel like locking the root account again, run this:
sudo passwd -l root
Now you have to allow the root user to login via ssh.  Change an option in this file:
sudo nano /etc/ssh/sshd_config
Find the line that says:
PermitRootLogin prohibit-password
and comment it out (so you have a record of the default configuration) and add a new line below it. They should look like this:
#PermitRootLogin prohibit-password
PermitRootLogin yes
[CTRL-O] and [CTRL-X] to exit, then run:
sudo service ssh restart
That's it!  Now we can use sshpass to automatically login to the machines and modify root files.  Be careful; there is nothing between you and total destruction of your cluster.

Upload a custom /etc/hosts file to each machine

I created a script to do this for me.  If I could have found a simple way to set static IPs that would have been preferable, but this way I don't have to manually rebuild the file every time the cluster is restarted.

Note: for now, this isn't compatible with my example - it only uses node increments of digits, while my example is using letters (grendel-b vs grendel-1).  I'll fix that later.  For now, I'd recommend reading all the way to the end of the walkthrough before starting, and just using numbers for your node increments.

Run the script from a separate computer that's on the local network (i.e., that can ssh into the machines), but which isn't one of the machines in the cluster.  Usage of the script goes like this:
bash create_hosts_file.sh [MACHINE_COUNT] [PASSWORD] [HOSTNAME_BASE]
Where HOSTNAME_BASE is the standard part of the hostname of each computer - if the computers were named grendel-a, grendel-b, and grendel-c, then the base would be "grendel-".

So, continuing the example used throughout and pretending there's 5 machines in total, this is what the command would look like:
mkdir -p ~/scripts && cd scripts
wget https://raw.githubusercontent.com/umhau/cluster/master/create_hosts_file.sh
bash create_hosts_file.sh 5 "hrunting" "grendel-"
If you don't get any errors, then you're all set! You can check the files were created by ssh'ing into one of the machines and checking /etc/hosts.
ssh beowulf@grendel-a
cat /etc/hosts
The output should look something like this:
127.0.0.1     localhost
192.168.133.100 grendel-a
192.168.133.101 grendel-b
192.168.133.102 grendel-c
192.168.133.103 grendel-d
If it doesn't look like that, with a line for localhost and one line after it for each machine, you're in trouble.  Google is your friend; it worked for me.

Creating a Shared Folder Between Machines

This way, I can put my script with fancy high-powered code in one place, and all the machines will be able to access it.

First, dependencies.  Install this one just on the 'master' node/computer (generally, the most powerful computer in the cluster, and definitely the one you labelled #1).
sudo apt-get install nfs-server
Next, install this on all the other machines:
sudo apt-get install nfs-client
Ok, we need to define a folder that can be standardized across all the machines: same idea as having a folder labeled "Dropbox" on each computer that you want your Dropbox account synced to - except in this case, the syncing is a little different.  Anything you put in the /mirror folder of the master node will be shared across all the other computers, but anything you put in a /mirror folder of the other nodes will be ignored.  That's why it's called a 'mirror' - there's a single folder that's being 'mirrored' by other folders.

We'll put it in the root directory.  Since we're mirroring it across all the machines, call it 'mirror'. Do this on all the machines:
sudo mkdir /mirror
Now go back to the master machine, and tell it to share the /mirror folder to the network: add a line to the /etc/exports file, and then restart the service.
echo "/mirror *(rw,sync)" | sudo tee -a /etc/exports
sudo service nfs-kernel-server restart
Maybe also add the following to the (rw,sync) options above:

  • no_subtree_check: This option prevents the subtree checking. When a shared directory is the subdirectory of a larger filesystem, nfs performs scans of every directory above it, in order to verify its permissions and details. Disabling the subtree check may increase the reliability of NFS, but reduce security.
  • no_root_squash: This allows root account to connect to the folder.

Great!  Now there's a folder on the master node on the network that we can mount and automatically get stuff from.  Time to mount it.

There's two ways to go about this - one, we could manually mount on every reboot, or two, we could automatically mount the folder on each of the 'slave' nodes.  I like the second option better.

There's a file called the fstab in the /etc directory.  It means, 'file system tabulator'.  This is what the OS uses on startup to know which partitions to mount.  What we're going to do is add another entry to that file - on every startup, it'll know to mount the network folder and present it like another external drive.

On each non-master machine (i.e., all the slave machines) run this command to append a new entry to the bottom of the fstab file.  The bit in quotes is the part getting added.
echo "grendel-a:/mirror    /mirror    nfs" | sudo tee -a /etc/fstab
That line is telling the OS a) to look for a drive located at grendel-a:/mirror, b) to mount it at the location /mirror, and c) that the drive is a 'network file system'.  Remember that if you're using your own naming scheme to change 'grendel-a' to whatever the hostname of your master node is.

Now, in lieu of rebooting the machines, run this command on each slave machine to go back through the fstab and remount everything according to whatever it (now) says.
sudo mount -a

Establishing A Seamless Communication Protocol Between Machines

Create a new user

This user will be used specifically for performing computations.  If beowulf is the administrative user, and root is being used as the setting-stuff-up-remotely-via-automated-scripts user, then this is the day-to-day-heavy-computations user.

The home folder for this user will be inside /mirror, and it's going to be given the same userid across all the accounts (I picked '1010') - we're making it as identical as possible for the purposes of using all the machines in the cluster as a single computational device.

We'll call the new user 'breca'.  Just for giggles, let's make the password 'acerb'.  Run the first command on the master node first, and the slaves afterwards.
useradd --uid 1010 -m -d /mirror/breca breca
Set a password.  Run on all nodes.
passwd breca
Add breca to the sudo group.
sudo adduser breca sudo
Since 'breca' will be handling all the files in the /mirror directory, we'll make that user the owner.  Run this only on the master node.
sudo chown -R breca: /mirror

Setting up passwordless SSH for inter-node communication

Next, a dependency.  Install this to each node (master and slaves):
sudo apt­-get install openssh-server
Next, login to the new user on the master node.
su - breca
On the master node, generate an RSA key pair for the breca user.  Keep the default location.  If you feel like it, you can enter a 'strong' passphrase, but we've already been working under the assumption security isn't important here.  Do what you like; nobody is going after your cluster (you hope).
ssh-keygen -t rsa
Add the key to your 'authorized keys'.
cd .ssh
cat id_rsa.pub >> authorized_keys
cd
And the nice thing is, what you've just done is being automatically mirrored to the other nodes.

With that, you should have passwordless ssh communication between all of your nodes.  Login to your breca account on each machine:
su - breca
 After logging in to your breca account on all of your machines, test your passwordless ssh capabilities by running -- say, from your master node to your first slave node --
ssh grendel-b
or from your second slave node into your master node:
ssh grendel-a
The only thing you should have to do is type 'yes' to confirm that some kind of fingerprint is authentic, and that's a first-time-only sort of thing.  However, because confirmation is requested, you have to perform the first login manually between each machine.  Otherwise communication could/will fail.  I haven't checked if it's necessary to ensure communication between slave nodes, so I did those too.

Note that since the same known_hosts file is shared among all the machines, it's only ever necessary to confirm a machine once.  So you could just log into all the machines consecutively from the master node, and once into the master node from one of the slaves, and all the nodes would thereafter have seamless ssh communication.

Troubleshooting

This process worked for me, following this guide exactly, so there's no reason it wouldn't work for you as well.  If a package is changed since the time of writing, however, it may fail in the future.  See section 7 of this guide to set up a keychain, which is the likely solution.

If, after rebooting, you can no longer automatically log into your breca account within the node (master-to-slave, etc.) the /mirror mounting procedure may have been interrupted.  i.e., possibly a network disconnect when /etc/fstab was executed such that grendel-a:/mirror couldn't be found.  If that's the case, the machines can't connect without passwords because they don't have access to the RSA key stored in the missing /mirror/calc/.ssh directory.  Log into each of the affected machines and remount everything in the fstab.
sudo mount -a

Installing Software Tools

You've been in and out of the 'beowulf' and 'breca' user accounts while setting up ssh.  Now it's time to go back to the 'beowulf' account.  If you're still in the breca account, run:
exit
These are tools the cluster will need to perform computations.  It's important to install all of this stuff prior to the MPICH2 software that ties it all together - I think the latter has to configure itself with reference to the available software.

If you're going to be using any compilers besides GCC, this is the time to install them.

This installs GCC.  Run it on each computer.
sudo apt-get install build-essential
I'm probably going to want Fortran as well, so I'm including that.
sudo apt-get install gfortran

Installing MPICH

And now, what we've all been waiting for: the commands that will actually make these disparate machines act as a single cluster.  Run this on each machine:
sudo apt-get install mpich
You can test that the install completed successfully by running:
which mpiexec
which mpirun
The output should be:
/usr/bin/mpiexec
and
/usr/bin/mpirun

The Machinefile

This 'machinefile' tells the mpich software what computers to use for computations, and how many processors are on each of those computers.  The code you run on the cluster will specify how many processors it needs, and the master node (which uses the machinefile) will start at the top of the file and work downwards until it has found enough processors to fulfill the code's request.  

First, find out how many processors you have available on each machine (the output of this command will include virtual cores).  Run this on each machine.
nproc
Next, log back into the breca user on the master node:
su - breca
Create a new file in the /mirror directory of the master node and open it:
touch machinefile && nano machinefile 
The order of the machines in the file determines which will be accessed first.  The format of the file lists the hostnames with the number of cores they have available.
grendel-c:4
grendel-b:4
grendel-a:4
You might want to remove one of the master node's cores for control purposes.  Who knows?  Up for experimentation.  I put the master node last for a similar reason.  The other stuff can get tied up first.

You should be up and running!  What follows is a short test to make sure everything is actually up and running.

Testing the Configuration

Go to your master node, and log into the breca account.
ssh beowulf@grendel-a
su - breca
cd into the /mirror folder.
cd /mirror
Create a new file called mpi_hello.c
touch mpi_hello.c && nano mpi_hello.c
Put the following code into the file, and [ctrl-o] and [ctrl-x] to save and exit.
#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}
Compile the code with the custom MPI C compiler:
mpicc mpi_hello.c -o mpi_hello
And run.
mpiexec -n 11 -f ./machinefile ./mpi_hello
Here's a breakdown of the command:
mpiexec              command to execute an mpi-compatible binary

-n 11                the number of cores to ask for - this should not be
                     more than the sum of cores listed in the machinefile

-f ./machinefile     the location of the machinefile

./mpi_hello          the name of the binary to run
If all went as hoped, the output should look like this:
Hello from processor 0 of 11
Hello from processor 1 of 11
Hello from processor 2 of 11
Hello from processor 3 of 11
Hello from processor 4 of 11
Hello from processor 5 of 11
Hello from processor 6 of 11
Hello from processor 7 of 11
Hello from processor 8 of 11
Hello from processor 10 of 11
Hello from processor 11 of 11
Make sure sum of the number of processors you listed in your machinefile corresponds to the number you asked for in the mpiexec command.

Note that you can totally ask for more processors than you actually listed - the MPICH sofware will assign multiple threads to each core to fulfill the request.  It's not efficient, but better than errors.

And that's it!  You have a working, tested beowulf cluster.

Sunday, January 15, 2017

A Recurrent Neural Network to generate sentences: step-by-step assembly (Mint version)

This one predicts (ok, procedurally generates) text word-by-word.  Others do it letter-by-letter, but then you have to deal with misspellings.  src. Inspiration.

Dependencies

I'm basing this off an installation script referenced in the github readme (here).  Instead of blindly running the script, I checked its contents and discovered it fails when run on linux mint.  So I simplified it, and changed/added a thing or two since it wasn't working anyway.  In case anyone wants to blindly run my version, here it is all cleaned up.

Updating your sources is always a good idea prior to installing a bunch of software.  Also, the package below makes it easy to add new repositories (see below).
sudo apt-get update
sudo apt-get install -y python-software-properties
Now things are getting serious.  This stuff includes GCC 4.9, since it seems GCC 5 isn't compatible with Torch 7.  There was a weird error where cmake wasn't installed when I put all the packages in the same apt-get install command - I split out the packages to their own lines, ran my install script again, and it worked.  Very odd.  Hence, my inefficient code below.
sudo apt-get install -y build-essential 
sudo apt-get install -y gcc 
sudo apt-get install -y g++ 
sudo apt-get install -y curl 
sudo apt-get install -y cmake 
sudo apt-get install -y libreadline-dev 
sudo apt-get install -y git-core 
sudo apt-get install -y libqt4-core 
sudo apt-get install -y libqt4-gui 
sudo apt-get install -y libqt4-dev 
sudo apt-get install -y libjpeg-dev 
sudo apt-get install -y libpng-dev 
sudo apt-get install -y ncurses-dev 
sudo apt-get install -y imagemagick 
sudo apt-get install -y libzmq3-dev 
sudo apt-get install -y gfortran 
sudo apt-get install -y unzip 
sudo apt-get install -y gnuplot 
sudo apt-get install -y gnuplot-x11 
sudo apt-get install -y ipython
sudo apt-get install -y libpcre3-dev
These are for a program called OpenBLAS, which stands for Basic Linear Algebra Subprograms.  Here's a bit more on the concept.
sudo apt-get install -y libopenblas-dev liblapack-dev
Install Torch (finally).  Note that when running the ./install.sh command, I'm letting bash automatically enter "Y" when prompted for a user response.  It's requesting permission to add torch to the system path in the .bashrc: essentially, creating a system shortcut for each initialization of the bash terminal.
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; 
echo "Y" | ./install.sh 
source ~/.bashrc
These tools are installed with Lua, a programming language that came with Torch.
/home/$USER/torch/install/bin/luarocks install nngraph
/home/$USER/torch/install/bin/luarocks install nninit 
/home/$USER/torch/install/bin/luarocks install optim
/home/$USER/torch/install/bin/luarocks install nn
/home/$USER/torch/install/bin/luarocks install underscore.lua --from=http://marcusirven.s3.amazonaws.com/rocks/
sudo /home/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/lib/x86_64-linux-gnu/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/ 
Make sure to add Torch to your PATH.  This ensures that when you run the command to use Torch, your computer can actually find the program.
to_path="/home/$USER/torch/install/bin"
echo "PATH=$PATH:$to_path" >> /home/$USER/.bashrc
source ~/.bashrc
With that, all dependencies should be satisfied. Time to install.

Installation

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Extra Fixes

I'm running torch on a gpu-less computer.  There's a glitch that occurs when running the test script in that scenario.  To avert it, you have to change the name of a function from CudaTensor to Tensor.  See here for details.  
cd word-rnn/util
nano SharedDropout.lua
The third line should look like this:
SharedDropout_noise = torch.CudaTensor()
Change it to this:
SharedDropout_noise = torch.Tensor()
And save.
CTRL + O
CTRL + X

Running word-rnn

Now you can run the test function.  Be aware that on a 4-core 2.8GHz i7 processor, this command took 21 hours to complete.  BUT it's a very cool command, and the result is amazing.
th train.lua -gpuid -1
To be continued...

Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...