Monday, October 30, 2017

Final post here

I'm switching over to github pages.  The continuation of this blog (with archives included) is at umhau.github.io

By the way, the long absence when the blog was wierdly down was only partly my fault...it happened after I tried to make blogger redirect to my new site, and it completely died.  You saw the result.  I finally realised I could reset the theme to something completely different to make the site work again.  Sorry it's such a weird theme.


Monday, June 5, 2017

Fast User Switching

This is for Ubuntu GNOME 17.04.  Maybe a lot of other versions of Linux as well.

To get the main login menu: CTRL + ALT + F1
To switch to the first logged in user: CTRL + ALT + F2
To switch to the second logged in user: CTRL + ALT + F3
Etc.

Sound Device Chooser

Personal memo, because I need this so much.  On Ubuntu GNOME 17.04, the OS fails to switch between sound devices when I plug in USB headphones.  I always have to go into the sound menu in the settings and manually switch.  At least this way, it's fewer clicks.

https://extensions.gnome.org/extension/906/sound-output-device-chooser/

Friday, May 12, 2017

Installing GNU APL

Two parts: the keyboard layout and the program itself.  This is on Ubuntu GNOME 17.04.

Install GNU APL.
cd ~/Downloads
wget ftp://ftp.gnu.org/gnu/apl/apl-1.7.tar.gz
tar xzf apl-1.7.tar.gz
cd apl-1.7
./configure
make 
sudo make install
Set up the keyboard (for all the weird symbols). On Ubuntu GNOME 17.04, go to:
settings -> region and language -> [+] input source -> English (region) -> APL (dyalog)
Then set a good fixed-width font so the symbols show up correctly.  Go here and download the recommended font, open and install it.  Then open the GNOME Tweak Tool:
fonts -> monospace -> APL385 Unicode Regular -> Select
And you're done.  Use
win + space
to switch between the fonts.

Saturday, May 6, 2017

Detaching modal dialog boxes from windows

This had been bugging me for a long time.  Gnome Shell Ubuntu decided that it would be nice if users couldn't get to applications while dialog boxes (like a save menu) were open.  Here's a fix to undo that decision.  src.  This is functional on 17.04.  

Detach dialog
dconf write /org/gnome/shell/overrides/attach-modal-dialogs false
Attach dialog
dconf write /org/gnome/shell/overrides/attach-modal-dialogs true

Wednesday, April 12, 2017

Set External Monitor as Default in Debian Console

I have a copy of debian running on a busted ThinkPad without an internal monitor.  It would be nice if the command line didn't revert to a 640x480 resolution on the external.  Solution: completely disable the internal monitor, so linux auto-sets the monitor resolution according to the specs of the external monitor. src.

Find the name of your monitors.  My internal card is an intel, so I can look in /sys for the EDID file (which has the EDID name, which is what we want).  src.
find /sys -name edid
Based on the output of that command, the name of my internal display is
LVDS-1
With that information, I'm going into GRUB and disabling the display.  Note it will not work at all after this, unless you change the setting back. 
sudo nano /etc/default/grub
edit the line from
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
(or whatever it was to begin with) to
GRUB_CMDLINE_LINUX_DEFAULT="quiet video=LVDS-1:d"
Keep whatever settings were already there.  Update GRUB, and reboot the computer.  
sudo update-grub
reboot

Saturday, February 18, 2017

tmux cheat sheet

A few commands that are useful to know. src.

managing sessions

tmux new -s foobar          | creates a new tmux session with given name foobar
tmux attach -t foobar       | attaches to an existing tmux session named foobar
tmux list-sessions          | list all available tmux sessions
tmux switch -t foobar       | switches to a session named foobar
tmux detach (ctrl + b, + d) | detach from the current session

managing windows

tmux new-window (ctrl + b, + c)     | create a new tmux window
tmux select-window -t :0-9 (ctrl + b, + 0-9) | choose an existing tmux window
tmux rename-window (ctrl + b, + ,)  | rename an existing tmux window

Wednesday, February 8, 2017

installing word-rnn on ubuntu server 16.04.

to install the pcre luarock:
sudo /mirror/$USER/torch/install/bin/luarocks install lrexlib-pcre PCRE_DIR=/usr/ PCRE_LIBDIR=/lib/x86_64-linux-gnu/
Because the PCRE files end up in some very weird places. 


download changes from git

So this should work when I've got multiple copies of the repo on different computers.  Run this (and the first two commands are optional) to refresh the local repo to the latest version.
git reset --hard HEAD
git clean -f
Those two will remove any local changes.  The last one actually gets the new version.
git pull

Thursday, February 2, 2017

branches on github

While github recommends that branches be made locally, I prefer the data security of keeping them off-site.  This is how to manage and merge a branch that's kept remote, on github.

https://github.com/Kunena/Kunena-Forum/wiki/Create-a-new-branch-with-git-and-manage-branches
https://try.github.io/levels/1/challenges/19
http://stackoverflow.com/a/6232535

Checkout a new branch on your computer (this is not something on github)
git checkout -b <branch>
Push the branch to github - now there's a new branch up there.
git push origin <branch>
And to switch to a different branch, use
git checkout <branch>

git cheat sheet

Still learning the system.
git init                |  Tell git to start watching the directory
git clone <repo>        |  Get a local copy of the repository
.gitignore              |  Contains patterns of files to ignore
git rm --cached <file>  |  Stop tracking the file in git
git add <file>          |  Stage a snapshot of the file
git rm <file>           |  Stage the file's removal
git mv <a> <b>          |  Rename 'a' to 'b' and stage the change
git status              |  Staging status of files
git status -s           |  Simpler version of status
git commit              |  Upload to server
git commit -m "txt"     |  Commit, with inline message

basic github - working with a cloned repository

How to use github properly.  Just the basics: cloning a repo, making a change, and uploading the change back to github.

sources

https://git-scm.com/book/en/v2/Git-Basics-Getting-a-Git-Repository
https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository
https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

how-to: summary

This is the super simple version with no commentary, and just a bit of explanatory text.  Remember, the warnings come after the spells.

Getting set up

Install git

sudo apt-get install git

configure login email address

Set your github name.
git config --global user.name "YOUR NAME"
set your email address to private in github. Go there:
profile --> settings --> emails --> [] keep my email address private
Set the global github private email address.
git config --global user.email "username@users.noreply.github.com"
Double-check the email address.
git config --global user.email

connect to github with https

Hold onto the github password for a while.
git config --global credential.helper cache
Extend the password timeout period.
git config --global credential.helper 'cache --timeout=1800'

making changes to a current project

First, create a local clone of your fork: I'm using the MPI Torch project as an example.
mkdir ~/projects && cd ~/projects
git clone https://github.com/umhau/mpiT.git
cd mpiT
Create or modify a file in your local repository, then stage it.
echo "Extra! Extra! Read all about it!" >> README.md
touch laughing.txt && echo "hahahaha" > laughing.txt
git add README.md
git add laughing.txt
Monitor the staging process.
git status -s
Commit your staged files prior to uploading.
git commit -m "write a commit message here"
Check that 'origin' is the correct short name of the github server.
git remote -v
Push your commit to the github server
git push origin master

how-to: annotated version

Getting set up

Install git

sudo apt-get install git

configure login email address

Tell git your name - who to credit your work to.
git config --global user.name "YOUR NAME"
configure your email address - you can keep your real one private by combining your github username with a special github email domain.  First, change github settings to keep the email address private.  Go to:
profile --> settings --> emails --> [] keep my email address private
Now configure your email address.  This tells git and github to use the private email address for all repositories downloaded to the computer.
git config --global user.email "username@users.noreply.github.com"
You can confirm the email address with this command:
git config --global user.email

connect to github

We're going to use HTTPS to download, and we're not going to deal with 2-factor authentication, and we are going to use a password manager to avoid repeatedly entering passwords.

This tells git to hold onto the github password for a while.
git config --global credential.helper cache
And this controls how long that 'while' is - default is 15 minutes, but I feel like 30 (it's counted in seconds).
git config --global credential.helper 'cache --timeout=1800'

making changes to a current project

We'll assume, for the moment, that you have an ongoing project - this is going to be either a copy ("fork") of someone else's project, or something you've already started uploading to github and want to keep working on The Right Way.  Since this is your ongoing project, you own it and have full permissions to mess with it.  You're also not going to be merging it with some other repository.

The idea behind the many steps involved with making changes (staging, the head, etc.), is to allow for both minor and major changes - a simple tweak by one guy vs. a complete overhaul vs. a new feature.  The really big changes can be made in forks of the project that get merged with the original (upstream) version, the sizable changes can be made in branches of the current fork (or original repository), and the small changes can be verified by the rest of the team prior to being added directly to a branch (master/primary or otherwise) of the current repository.  

It's a great system, but a real pity there's no simple way to get started.

First, create a local clone of your fork: I'm using the MPI Torch project as an example.  Make sure you have a place to put it.
mkdir ~/projects && cd ~/projects
The .git link you need to download with is on github -- look over on the right where the opened menu is:

The link is over on the left.  After you've got it, run the command below (or your equivalent, on a project you can make changes to).  
git clone https://github.com/umhau/mpiT.git
cd mpiT
Now you've "created a local clone" of your repository/fork. Nice! You've got the code on your computer.  I'm pretty sure that everything else you do with github related to that 'local clone' has to be done while you're 'within the directory' -cd'd inside mpiT (in this case).

As you change files, git will be watching - anything you change will be marked modified, anything you don't change will be marked unmodified.  You can stage a file whenever you want to record a snapshot of your current progress (and you can undo snapshots, too, but that's not needed here).  Staged files are added to your next commit.

A commit is a collection of files that are being prepared for upload - until the commit is pushed, it's all local to your machine.

Let's pretend you go ahead and make some changes to the code you downloaded.  Maybe you added something to the README.md.  Remember, you're still in the ~/projects/mpiT/ directory:
echo "This is super important stuff!" >> README.md
If we check which tracked files have been changed, git will tell us that README.md has been modified, and has not yet been 'staged' for the next commit to the server.
git status
if you're ready to stage the file, run
git add README.md
you can also use the git add command on a directory, and it will recursively stage everything in the directory for the next commit.

If you want to add a new file to the repository
echo "hahahaha" > laughing.txt
you have to tell git to track it -
git add laughing.txt
and it will be automatically staged (what else was git supposed to do with it?).

Also note that if you stage a file, and then edit it again, the staged version will remain whatever version you had when you ran git add.  If you want to stage the new version, you have to run git add again.

Use git status to monitor the version of each file being staged.  Use
git status -s
to get a less 'verbose' version of the status output.  When everything has been edited and modified and staged properly, use
git commit -m "write a commit message here"
to upload your changes.  Keep the quotation marks when you write your message, which is intended to tell others what changes you made.  If you don't include the -m "commit message" bit, then you'll be prompted for a longform message in a command line text editor.

While in the directory, git will automatically name the server you cloned from "origin".  That way, when you need to do something related to the online github version of your code, you can reference it with "origin".

Here, you can check the name of the remote repository you're working with.  A "remote repository" is what you call the online (i.e., not-on-your-laptop) version of the code.   You can add more, but that's not needed here.
git remote
You should see the name used to refer to the repository -
origin
just as discussed above.  And if you want to see exactly what server 'origin' refers to, you can run
git remote -v
to see a list - in my case, probably something like this:
origin https://github.com/umhau/mpiT.git (fetch)
origin https://github.com/umhau/mpiT (push)
Back on track: let's upload the commit you assembled (the commit: your collection of staged files).  It's called pushing, and you do it like this:
git push origin master
Now the changes you made on your computer have been uploaded to github.  That's it!


Accelerate an OpenBSD NFS server

The nfs server on openbsd has been super slow.  I thought it was the result of wifi + an old computer until I tried an scp download.  the difference was several orders of magnitude.  Anyway, the solution was to cut down the size of the data packets used by the protocol.

Here's my /etc/fstab - the change is in bold.  Apparently there's a sweet spot of not-too-small and not-too-big.  These sizes are measured in bytes, by the way.  That's a packet size of 4 kb that I'm specifying.
one:/home/admin/storage     /storage  nfs  rsize=4096,wsize=4096  0 0

Wednesday, February 1, 2017

NFS on RPI2

Just a quick note - following the instructions I've put up elsewhere here, the only thing to add for an RPI is that rpc.statd needs to be started for the network mount to work.  Run this before the sudo mount -a command:
sudo service rpcbind restart



Remove known host from Secure Shell

So, I have an old chromebook that I use as a thin client (first generation - no, not a CR-48).  Problem is, RSA keys change and I don't have a proper shell.  Just 'secure shell'.  This command removes a known host from the known_hosts file.  src.

Open up Secure Shell as a browser tab.  Go
'that-funny-button-for-misc-options' --> More Tools--> Developer Tools
Go to the Console tab.  In the console, enter the following, where foo is the index number of the host record that you need to remove.
term_.command.removeKnownHostByIndex(foo)

NFS server on OpenBSD

I still use OpenBSD for my server, so here goes setting up an NFS on it.  This is a fantastic resource.

Note: if having trouble with the vi text editor, I listed a few simple commands elsewhere on the blog.  For now, a few quick reminders:
i        -->       enters Insert mode (where you can enter text)
[ESC]    -->       enters command mode
:w       -->       saves the file (have to be in command mode)
:q       -->       exits the file (have to be in command mode)

Let's call the shared folder 'storage'.  We'll put it in the home directory of the user 'admin'.  We have to create the user, and the folder.
useradd -b /home/faculty/ -s /bin/bash -m admin
passwd admin
Let's use the password
Password!
since this is a local machine, and I'm assuming that the folks on the LAN are friendly.

Next, make sure the new user owns their home directory.
chown -R admin: admin
And then create the shared directory inside /home/admin,
mkdir /home/admin/storage
and make sure that everyone can read/write to that spot.
chmod -R 777 /home/admin/storage
I think nfs can be activated by adding the following to rc.conf.local. (src; might not be updated, and I can't see more than the first bit)
vi /etc/rc.conf.local
add to the end of the file:
portmap=YES
nfs_server=YES
OR, I can run these two commands.  There's this thing about multiple server instances, so that I can handle concurrent requests, that I can control with the following commands...I just don't know if the services will be enabled after a reboot.
rcctl enable portmap mountd nfsd
rcctl set nfsd flags -tun 4
Once the services are started, the /etc/exports file needs an entry.  This is how the machine knows to share the folder - and who to share it to.
vi /etc/exports
add to the bottom (this gives everyone accessing the folder root access),
/home/admin/storage -alldirs -mapall=root
Now the nfs service can be started
rcctl start portmap mountd nfsd
and in case you edited the /etc/exports file while the NFS was running, restart the service.
rcctl reload mountd

Mount

to mount the new network folder, you have to create your own location for the folder to present itself, and then set the folder to automount.

Also, don't forget to install the nfs software.  It came default on OpenBSD, but not so much on Linux Mint.
sudo apt-get install nfs-server -y
create /storage in the root directory
sudo mkdir /storage
give everyone all permissions
chmod 777 /storage
edit the fstab to automount the folder - we'll assume the hostname of the OpenBSD server is 'server1'
sudo nano /etc/fstab
and add this to the bottom of the file
server1:/home/admin/storage     /storage  nfs 
and remount everything
sudo mount -a
that's it!  You should have full access to the network folder.

Friday, January 27, 2017

Command line system resource monitor

Shows cpu usage, memory, swap.
sudo apt-get install htop
htop

Show CPU info via command line

This gives a ton of information - way more that I generally ever need.
less /proc/cpuinfo
This is the tidy version.
lscpu
This is the min and max clock speed of the CPU:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq 
This is a cool command to keep track of the current CPU clock speed.
sudo watch -n 1  cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq


Thursday, January 26, 2017

Ubuntu Server 16.04 not detecting wifi card

This turns out to be a relatively simple issue.

This command is my go-to for internet connection diagnostics, but it wasn't showing my wifi card.
ifconfig
This shows all interfaces, not just the activated ones.
ifconfig -a
to activate a hardware device after determining its name, run:
ifconfig "device name" up
Bonus: this gives you way more than you'll ever need to know about the hardware capabilities of the device.
iw list

(Internet via Ethernet) + (SSH via wireless)

= (sustained SSH during MPICH computing)

I've been losing SSH connection after starting process jobs on my new beowulf cluster.  This is my current fix, since my theory is that the network switch is so clogged with MPI-related communication (which does take place via ssh) that there's no bandwidth left for my administrative SSH connection.  Theory supported by observation that when I plug my unrelated control machine into the switch it can't ping google.

assumptions

  • Ubuntu 16.04.1 LTS
  • working wireless and ethernet: I had to do this and this.

sources

check wifi hardware capability

Run the command 
iw list
And look for a section like the following.  If it includes 'AP' (see emboldened bit), you're golden.  If not, look for a different wireless card.
Supported interface modes:  
         * IBSS 
         * managed  
         * AP 
         * AP/VLAN  
         * monitor 

install dependencies

sudo apt-get install rfkill hostapd hostap-utils iw dnsmasq   

identify interface names

As of ubuntu 16.04, the standard wlan0 and eth0 interface names are no longer in use.  You'll have to identify them specifically.  Use the following command, which lists the contents of the folder for each interface device, and look for the device that has a folder named 'wireless'. src.
ls /sys/class/net/*
Observe the assumptions above to see what I'm calling them.

configure wifi settings

There's three files you'll have to configure.  Since I'm logged in via ssh, I don't want to interrupt my connection until I've created a new access point I can connect to.  So I'll walk through editing each file in turn, then I'll have one command at the end that activates all the changes.  

configure wireless interface: /etc/network/interfaces

Backup your current interface file.
sudo cp /etc/network/interfaces /etc/network/interfaces.bak
and then edit the original
sudo nano /etc/network/interfaces
replace the contents of the file - change the interface names as appropriate.
auto lo
iface lo inet loopback

auto enp2s0
iface enp2s0 inet dhcp

auto wlp1s0
iface wlp1s0 inet static
hostapd /etc/hostapd/hostapd.conf
address 192.168.3.14
netmask 255.255.255.0
Normally I'd say that here's where you restart the interface, but we're saving that for the end.

configure the access point: /etc/hostapd/hostapd.conf

backup the original file - it's ok if there's nothing there.
sudo cp /etc/hostapd/hostapd.conf /etc/hostapd/hostapd.conf.bak
edit the original
sudo nano /etc/hostapd/hostapd.conf
put this in:
interface=wlp1s0
driver=nl80211
ssid=test
hw_mode=g
channel=1
macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0
wpa=3
wpa_passphrase=1234567890
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP
Inexplicably, this only seems to produce a detectable wifi access point when the ssid is 'test'.  I tried several other non-keyword names, and none of them worked.  Go back to 'test', and it worked.  Did it several times...magic.

Save and exit.

configure the DHCP server

this is where the access point actually becomes something you can access. backup: 
sudo cp /etc/dnsmasq.conf /etc/dnsmasq.conf.bak
edit original - since the file is so big, I rm'd the original and pasted the contents below into an empty file.
sudo rm /etc/dnsmasq.conf
sudo nano /etc/dnsmasq.conf
make it look like this:
# Never forward plain names (without a #dot or domain part)
domain-needed

# Only listen for DHCP on wlan0
interface=wlp1s0

# create a domain if you want, comment #it out otherwise
# domain=Pi-Point.co.uk

# Create a dhcp range on your /24 wlp1s0 #network with 12 hour lease time
dhcp-range=192.168.3.15,192.168.3.254, 255.255.255.0,12h
Save and exit.

implement changes

this is going to be one big command. if it works, you're in business...if it doesn't, you'll have to login directly to the machine for troubleshooting.
sudo ifdown wlp1s0; sudo ifup wlp1s0; sudo service hostapd restart; sudo service dnsmasq restart
Worked for me: I now have a secondary wireless access to my beowulf cluster for when the ethernet gets clogged with MPI signals.



Wednesday, January 25, 2017

Basic Vim

The cheat sheets and guides out there don't seem to provide a practical intro to Vim.  I'm not able to use MS Code on one of my primary interfaces, so I'm looking for the next best thing.  Vim, so I've heard, is probably it.  This is a great little tutorial to introduce the basics.

There's two modes: command mode, and insert mode.  Command mode is where you do things that would normally be accessed via cursor, arrow keys or a menu, and insert mode is where you type letters and they appear on the screen and you can use the arrow keys like you're used to.  When you open vim, you start in command mode.

This should get you to about a nano level of proficiency.

Basic Usage

open foo  | vim foo
save file | :w
quit file | :q
Command mode | [ESC]
                                  |   k
Move cursor left, right, up, down | h   l
                                  |   j
Insert here | i
Insert new line below | o
Delete char under cursor | x
Here's a nice cheat sheet for further use.

Tuesday, January 24, 2017

RNN with Torch and MPI

This is being installed on machines running Ubuntu Server 16.04.1 LTS.  Does not work on Linux Mint (the torch install script doesn't detect that OS).

Most of the following installations have to be performed on each computer.  I didn't re-download everything, since it was going to be put in the same place, but I did cd in and re-run the installation procedure.  That ensured the necessary files were added to all the right places elsewhere in the system.

Here, I'm walking through the process of running Torch on a cluster.  CPUs, not GPUs.  The performance benefit comes from the slave nodes being allowed greater latitude in searching for local optima to 'solve' the neural net.  Every so often, they 'touch base' with the master node and synchronize the result of their computations.  Read the abstract of Sixin Zhang's paper to get a more detailed idea of what's happening.  As far as the implementation goes, "the idea is to transform the torch data structure (tensor, table etc) into a storage (contiguous in memory) and then send/recv [sic] it." src.

Background Sources

Keep track of where I found the info I used to figure this out.

https://bbs.archlinux.org/viewtopic.php?id=159999
http://torch.ch/docs/getting-started.html
https://groups.google.com/forum/#!topic/torch7/Xs814a5_xgI

Set up MPI (beowulf cluster)

Follow the instructions in these two posts first.  They get you to the point of a working cluster, starting from a collection of unused PCs and the relevant hardware.

https://nixingaround.blogspot.com/2017/01/a-homebrew-beowulf-cluster-part-1.html
https://nixingaround.blogspot.com/2017/01/a-homemade-beowulf-cluster-part-2.html

prevent SSH from losing connection

I had some trouble here, where I was trying to use ssh over the same wires that were providing MPI communication in the cluster.  I kept losing connection after initializing the computations.  It may not be necessary, so I wouldn't do this unless you run into trouble of that sort.  

https://nixingaround.blogspot.com/2017/01/internet-via-ethernet-ssh-via-wireless.html

Ok, that's not an optimal solution. Better to initialize a virtual terminal and run the computations in that.  When the connection is inevitably dropped, just recover that terminal.

http://unix.stackexchange.com/questions/22781/how-to-recover-a-shell-after-a-disconnection

Install Torch

Note: it may be useful to install the MKL library ahead of torch.  It accelerates the math routines that I assume will be present in the computations I'm going to perform.  

This provides dependencies needed to install the mpiT package that lets Torch7 work with MPI.  Start in the breca home directory.  On the master node, run the following.
cd
git clone https://github.com/torch/distro.git ~/torch --recursive
Then, on all nodes (master and slave), run the following from the breca account:
cd ~/torch; bash install-deps
./install.sh
[I'm not sure, but I think MPICH has to be reinstalled after GCC 4.x is installed with the dependencies.  Leaving this note here in case of future problems.]

After the install script finished running, it told me that it had not updated my shell profile.  So, we're adding a line to the ~/.profile script.  (we're using that, and not the bashrc file, because when logging on to the breca account bash isn't automatically run.  If I ever forget and try to use Torch without bash, I could run into problems this can avoid.)

Do the following on all nodes:
echo ". /mirror/breca/torch/install/bin/torch-activate" | sudo tee -a /mirror/breca/.profile
Now re-run the file, so the code you added is executed.
source ~/.profile
Installing this way allows you to only download the package once, but use it to install the software to all nodes in the cluster.  (and as a side note, the install-deps script doesn't detect Linux Mint - it's one of the reasons this walk-through is using Ubuntu Server)

Test that Torch has been installed:
th
Close the program
exit

MPI compatibility with Torch

Source: https://github.com/sixin-zh/mpiT

Do this on the master node. You'll be able to access the downloaded files from all the nodes - they're going in the /mirror directory. Download from github and install.
cd ~/
mkdir -p tools && cd tools
git clone https://github.com/sixin-zh/mpiT
cd
Now Do the rest of the steps on all the nodes, master and slave.
cd 
cd tools/mpiT
By default, MPI_PREFIX should be set to /usr.  See link.
export MPI_PREFIX="/usr"
echo "export MPI_PREFIX='/usr'" >> ~/.profile
Since I'm working with MPICH rather than OpenMPI (see cluster installation notes above),
luarocks make mpit-mvapich-1.rockspec

Tests

First, figure out how many processors you have.  You did already; that's the sum of the numbers in your machinefile in the /mirror directory.  We'll say you have 12 cores.  Since our counting starts at 0, tell the computer you have 11.  Adjust according to your actual situation. 

Next, use a bunch of terminals and log into each of your nodes simultaneously.  Install:
sudo apt-get install htop 
And run
htop
on each machine and watch the CPU usage as you perform the following tests.  If only the master node shows activity, you have a problem.  

Create ./data/torch7 in the home directory, and then download the test data to that location.  Ensure you're logged in as the MPI user.
mkdir -p ~/data/torch7/mnist10/ && cd ~/data/torch7/mnist10
wget http://cs.nyu.edu/~zsx/mnist10/train_32x32.th7
wget http://cs.nyu.edu/~zsx/mnist10/test_32x32.th7
cd ~/tools/mpiT
Now run the tests. Sanity check: did mpiT install successfully? Note: I ran into an 'error 75' at this point, and the solution was to explicitly define the location of the files involved starting from the root directory. 
mpirun -np 11 /mirror/machinefile th /mirror/breca/tools/mpiT/test.lua
Check that the MPI integration is working.  Move down to the folder with the asynchronous algorithms.
cd asyncsgd
I think this test only needs to run on the master node - as long as you've installed everything to all the nodes (as appropriate), it doesn't need to be run everywhere.  I think it's just checking that Torch is successfully configured to run on a CPU.
th claunch.lua
Test bandwidth: I have no idea what this does, but it fails if the requested number of processors is odd.  I'm sticking with the default of 4 processors, which (I'm guessing) is the number on a single node.  As long as it works...?  It seems to be checking the bandwidth through the cluster.  There isn't a whole lot of documentation.
mpirun -np 4 -f ../../../../machinefile th ptest.lua 
Try parallel mnist training - this is the one that should tell you what's up.  AFAIK, you'll probably end up using a variant of this code to run whatever analysis you have planned.  If you look inside, you'll notice that what you're running is some kind of abstraction - the algorithm (such as it is for a test run) seems to be implemented in goot.lua.  In fact, this is a 'real-world' test of sorts - the MNIST data set is the handwritten character collection researchers like to use for testing their models.
mpirun -np 11 -f ../../../../machinefile th mlaunch.lua
and this is as far as I've actually made it without errors (up to this point, barring abnormalities in the PCs used, everything works perfectly for me).

Install Word RNN

Clone the software from github.
mkdir ~/projects
cd projects
git clone https://github.com/larspars/word-rnn.git
That's actually all there is to it.  Now cd into the word-rnn directory to run the test stuff.  Before the tests and tools, though, there's a fix that you have to perform.

Saturday, January 21, 2017

A Homemade Beowulf Cluster: Part 2, Machine Configuration

This section starts with a set of machines all tied together with an ethernet switch and running Ubuntu Server 16.04.1.  If the switch is plugged into the local network router, then the machines can be ssh'd into.

This should be picking up right where Part 1 left off. src.  So.

Enabling Scripted, Sudo Remote Access

The first step in the configuration process is to modify the root-owned host files on each machine.  I'm not doing that by hand, and I've already spent way too long trying to find a way to edit root-owned files through ssh automatically.

It's not possible without "security risks".  Since this is a local cluster, and my threat model doesn't include -- or care about -- people hacking in to the machines or me messing things up, I'm going the old fashioned way.  I also don't care about wiping my cluster accidentally, since I'm documenting the exact process I used to achieve it (and I'm making backups of any data I create).

Log into each machine in turn, and enter the password when prompted.
ssh beowulf@grendel-[X]
Recall that the password is
hrunting
Create a password for the root account.
sudo passwd root
At the prompt, enter your password.  We'll assume it's the same as the previously-defined user.
hrunting
Now the root account has a password, but it's still locked.  Time to unlock it.
sudo passwd -u root 
Note: if you ever feel like locking the root account again, run this:
sudo passwd -l root
Now you have to allow the root user to login via ssh.  Change an option in this file:
sudo nano /etc/ssh/sshd_config
Find the line that says:
PermitRootLogin prohibit-password
and comment it out (so you have a record of the default configuration) and add a new line below it. They should look like this:
#PermitRootLogin prohibit-password
PermitRootLogin yes
[CTRL-O] and [CTRL-X] to exit, then run:
sudo service ssh restart
That's it!  Now we can use sshpass to automatically login to the machines and modify root files.  Be careful; there is nothing between you and total destruction of your cluster.

Upload a custom /etc/hosts file to each machine

I created a script to do this for me.  If I could have found a simple way to set static IPs that would have been preferable, but this way I don't have to manually rebuild the file every time the cluster is restarted.

Note: for now, this isn't compatible with my example - it only uses node increments of digits, while my example is using letters (grendel-b vs grendel-1).  I'll fix that later.  For now, I'd recommend reading all the way to the end of the walkthrough before starting, and just using numbers for your node increments.

Run the script from a separate computer that's on the local network (i.e., that can ssh into the machines), but which isn't one of the machines in the cluster.  Usage of the script goes like this:
bash create_hosts_file.sh [MACHINE_COUNT] [PASSWORD] [HOSTNAME_BASE]
Where HOSTNAME_BASE is the standard part of the hostname of each computer - if the computers were named grendel-a, grendel-b, and grendel-c, then the base would be "grendel-".

So, continuing the example used throughout and pretending there's 5 machines in total, this is what the command would look like:
mkdir -p ~/scripts && cd scripts
wget https://raw.githubusercontent.com/umhau/cluster/master/create_hosts_file.sh
bash create_hosts_file.sh 5 "hrunting" "grendel-"
If you don't get any errors, then you're all set! You can check the files were created by ssh'ing into one of the machines and checking /etc/hosts.
ssh beowulf@grendel-a
cat /etc/hosts
The output should look something like this:
127.0.0.1     localhost
192.168.133.100 grendel-a
192.168.133.101 grendel-b
192.168.133.102 grendel-c
192.168.133.103 grendel-d
If it doesn't look like that, with a line for localhost and one line after it for each machine, you're in trouble.  Google is your friend; it worked for me.

Creating a Shared Folder Between Machines

This way, I can put my script with fancy high-powered code in one place, and all the machines will be able to access it.

First, dependencies.  Install this one just on the 'master' node/computer (generally, the most powerful computer in the cluster, and definitely the one you labelled #1).
sudo apt-get install nfs-server
Next, install this on all the other machines:
sudo apt-get install nfs-client
Ok, we need to define a folder that can be standardized across all the machines: same idea as having a folder labeled "Dropbox" on each computer that you want your Dropbox account synced to - except in this case, the syncing is a little different.  Anything you put in the /mirror folder of the master node will be shared across all the other computers, but anything you put in a /mirror folder of the other nodes will be ignored.  That's why it's called a 'mirror' - there's a single folder that's being 'mirrored' by other folders.

We'll put it in the root directory.  Since we're mirroring it across all the machines, call it 'mirror'. Do this on all the machines:
sudo mkdir /mirror
Now go back to the master machine, and tell it to share the /mirror folder to the network: add a line to the /etc/exports file, and then restart the service.
echo "/mirror *(rw,sync)" | sudo tee -a /etc/exports
sudo service nfs-kernel-server restart
Maybe also add the following to the (rw,sync) options above:

  • no_subtree_check: This option prevents the subtree checking. When a shared directory is the subdirectory of a larger filesystem, nfs performs scans of every directory above it, in order to verify its permissions and details. Disabling the subtree check may increase the reliability of NFS, but reduce security.
  • no_root_squash: This allows root account to connect to the folder.

Great!  Now there's a folder on the master node on the network that we can mount and automatically get stuff from.  Time to mount it.

There's two ways to go about this - one, we could manually mount on every reboot, or two, we could automatically mount the folder on each of the 'slave' nodes.  I like the second option better.

There's a file called the fstab in the /etc directory.  It means, 'file system tabulator'.  This is what the OS uses on startup to know which partitions to mount.  What we're going to do is add another entry to that file - on every startup, it'll know to mount the network folder and present it like another external drive.

On each non-master machine (i.e., all the slave machines) run this command to append a new entry to the bottom of the fstab file.  The bit in quotes is the part getting added.
echo "grendel-a:/mirror    /mirror    nfs" | sudo tee -a /etc/fstab
That line is telling the OS a) to look for a drive located at grendel-a:/mirror, b) to mount it at the location /mirror, and c) that the drive is a 'network file system'.  Remember that if you're using your own naming scheme to change 'grendel-a' to whatever the hostname of your master node is.

Now, in lieu of rebooting the machines, run this command on each slave machine to go back through the fstab and remount everything according to whatever it (now) says.
sudo mount -a

Establishing A Seamless Communication Protocol Between Machines

Create a new user

This user will be used specifically for performing computations.  If beowulf is the administrative user, and root is being used as the setting-stuff-up-remotely-via-automated-scripts user, then this is the day-to-day-heavy-computations user.

The home folder for this user will be inside /mirror, and it's going to be given the same userid across all the accounts (I picked '1010') - we're making it as identical as possible for the purposes of using all the machines in the cluster as a single computational device.

We'll call the new user 'breca'.  Just for giggles, let's make the password 'acerb'.  Run the first command on the master node first, and the slaves afterwards.
useradd --uid 1010 -m -d /mirror/breca breca
Set a password.  Run on all nodes.
passwd breca
Add breca to the sudo group.
sudo adduser breca sudo
Since 'breca' will be handling all the files in the /mirror directory, we'll make that user the owner.  Run this only on the master node.
sudo chown -R breca: /mirror

Setting up passwordless SSH for inter-node communication

Next, a dependency.  Install this to each node (master and slaves):
sudo apt­-get install openssh-server
Next, login to the new user on the master node.
su - breca
On the master node, generate an RSA key pair for the breca user.  Keep the default location.  If you feel like it, you can enter a 'strong' passphrase, but we've already been working under the assumption security isn't important here.  Do what you like; nobody is going after your cluster (you hope).
ssh-keygen -t rsa
Add the key to your 'authorized keys'.
cd .ssh
cat id_rsa.pub >> authorized_keys
cd
And the nice thing is, what you've just done is being automatically mirrored to the other nodes.

With that, you should have passwordless ssh communication between all of your nodes.  Login to your breca account on each machine:
su - breca
 After logging in to your breca account on all of your machines, test your passwordless ssh capabilities by running -- say, from your master node to your first slave node --
ssh grendel-b
or from your second slave node into your master node:
ssh grendel-a
The only thing you should have to do is type 'yes' to confirm that some kind of fingerprint is authentic, and that's a first-time-only sort of thing.  However, because confirmation is requested, you have to perform the first login manually between each machine.  Otherwise communication could/will fail.  I haven't checked if it's necessary to ensure communication between slave nodes, so I did those too.

Note that since the same known_hosts file is shared among all the machines, it's only ever necessary to confirm a machine once.  So you could just log into all the machines consecutively from the master node, and once into the master node from one of the slaves, and all the nodes would thereafter have seamless ssh communication.

Troubleshooting

This process worked for me, following this guide exactly, so there's no reason it wouldn't work for you as well.  If a package is changed since the time of writing, however, it may fail in the future.  See section 7 of this guide to set up a keychain, which is the likely solution.

If, after rebooting, you can no longer automatically log into your breca account within the node (master-to-slave, etc.) the /mirror mounting procedure may have been interrupted.  i.e., possibly a network disconnect when /etc/fstab was executed such that grendel-a:/mirror couldn't be found.  If that's the case, the machines can't connect without passwords because they don't have access to the RSA key stored in the missing /mirror/calc/.ssh directory.  Log into each of the affected machines and remount everything in the fstab.
sudo mount -a

Installing Software Tools

You've been in and out of the 'beowulf' and 'breca' user accounts while setting up ssh.  Now it's time to go back to the 'beowulf' account.  If you're still in the breca account, run:
exit
These are tools the cluster will need to perform computations.  It's important to install all of this stuff prior to the MPICH2 software that ties it all together - I think the latter has to configure itself with reference to the available software.

If you're going to be using any compilers besides GCC, this is the time to install them.

This installs GCC.  Run it on each computer.
sudo apt-get install build-essential
I'm probably going to want Fortran as well, so I'm including that.
sudo apt-get install gfortran

Installing MPICH

And now, what we've all been waiting for: the commands that will actually make these disparate machines act as a single cluster.  Run this on each machine:
sudo apt-get install mpich
You can test that the install completed successfully by running:
which mpiexec
which mpirun
The output should be:
/usr/bin/mpiexec
and
/usr/bin/mpirun

The Machinefile

This 'machinefile' tells the mpich software what computers to use for computations, and how many processors are on each of those computers.  The code you run on the cluster will specify how many processors it needs, and the master node (which uses the machinefile) will start at the top of the file and work downwards until it has found enough processors to fulfill the code's request.  

First, find out how many processors you have available on each machine (the output of this command will include virtual cores).  Run this on each machine.
nproc
Next, log back into the breca user on the master node:
su - breca
Create a new file in the /mirror directory of the master node and open it:
touch machinefile && nano machinefile 
The order of the machines in the file determines which will be accessed first.  The format of the file lists the hostnames with the number of cores they have available.
grendel-c:4
grendel-b:4
grendel-a:4
You might want to remove one of the master node's cores for control purposes.  Who knows?  Up for experimentation.  I put the master node last for a similar reason.  The other stuff can get tied up first.

You should be up and running!  What follows is a short test to make sure everything is actually up and running.

Testing the Configuration

Go to your master node, and log into the breca account.
ssh beowulf@grendel-a
su - breca
cd into the /mirror folder.
cd /mirror
Create a new file called mpi_hello.c
touch mpi_hello.c && nano mpi_hello.c
Put the following code into the file, and [ctrl-o] and [ctrl-x] to save and exit.
#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}
Compile the code with the custom MPI C compiler:
mpicc mpi_hello.c -o mpi_hello
And run.
mpiexec -n 11 -f ./machinefile ./mpi_hello
Here's a breakdown of the command:
mpiexec              command to execute an mpi-compatible binary

-n 11                the number of cores to ask for - this should not be
                     more than the sum of cores listed in the machinefile

-f ./machinefile     the location of the machinefile

./mpi_hello          the name of the binary to run
If all went as hoped, the output should look like this:
Hello from processor 0 of 11
Hello from processor 1 of 11
Hello from processor 2 of 11
Hello from processor 3 of 11
Hello from processor 4 of 11
Hello from processor 5 of 11
Hello from processor 6 of 11
Hello from processor 7 of 11
Hello from processor 8 of 11
Hello from processor 10 of 11
Hello from processor 11 of 11
Make sure sum of the number of processors you listed in your machinefile corresponds to the number you asked for in the mpiexec command.

Note that you can totally ask for more processors than you actually listed - the MPICH sofware will assign multiple threads to each core to fulfill the request.  It's not efficient, but better than errors.

And that's it!  You have a working, tested beowulf cluster.

Final post here

I'm switching over to github pages .  The continuation of this blog (with archives included) is at umhau.github.io .  By the way, the ...