http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
http://www.petur.eu/blog/?p=59
http://webappl.blogspot.ca/2011/05/setting-up-mpich2-cluster-with-ubuntu.html
Well, how exactly I set my cluster? The steps are below, detailed information on all of them can be found on the sources I published, here is just the list:
1. Install MPICH2 - sudo aptitude install mpich2
2. Change /etc/hosts file, include hosts information for the cluster hosts. Don't forget to remove the 127.0.1.1 address for your hostname.
3. Enable ssh-key login for some (cluster) user.
4. Create mpich host file. I used mpd.hosts, but this is not the default name, and I don't know what it is (or if there is a default name at all). I just executed mpiexec with -f parameter.
5. Set .mpd.conf with the password.
That's all to have functional cluster. But how to prove the cluster is populated around the hosts? I used three tools to demonstrate this.
1. mpi_hello.c There are many variations of this simple program, just choose one of them.
2. John the ripper as described in petur.eu blog
3. tcpdump
Now the results. Compile mpi_hello.c and John. On one of the hosts run tcpdump host other_host. On other execute mpi_hello like this:
cluster@d:~$ mpiexec -f mpd.hosts -n 4 ./mpi_hello
Hello from processor 2 of 4
Hello from processor 0 of 4
Hello from processor 3 of 4
Hello from processor 1 of 4
Well, that will work even if there is one host, and give the same result. But the real proof is the tcpdump output on the other host:
tcpdump host host_2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:54:34.439518 ARP, Request who-has x.lan tell d.lan, length 46
16:54:34.439588 ARP, Reply x.lan is-at 08:00:27:dc:9d:cc (oui Unknown), length 28
16:54:34.440101 IP d.lan.33199 > x.lan.ssh: Flags [S], seq 922409158, win 14600, options [mss 1460,sackOK,TS val 2205427 ecr 0,nop,wscale 3], length 0
16:54:34.440157 IP x.lan.ssh > d.lan.33199: Flags [S.], seq 1120118912, ack 922409159, win 14480, options [mss 1460,sackOK,TS val 2210524 ecr 2205427,nop,wscale 3], length 0
16:54:34.441520 IP d.lan.33199 > x.lan.ssh: Flags [.], ack 1, win 1825, options [nop,nop,TS val 2205428 ecr 2210524], length 0
16:54:34.456768 IP x.lan.ssh > d.lan.33199: Flags [P.], seq 1:40, ack 1, win 1810, options [nop,nop,TS val 2210528 ecr 2205428], length 39
16:54:34.457608 IP d.lan.33199 > x.lan.ssh: Flags [.], ack 40, win 1825, options [nop,nop,TS val 2205432 ecr 2210528], length 0
16:54:34.458207 IP d.lan.33199 > x.lan.ssh: Flags [P.], seq 1:40, ack 40, win 1825, options [nop,nop,TS val 2205432 ecr 2210528], length 39
16:54:34.458381 IP x.lan.ssh > d.lan.33199: Flags [.], ack 40, win 1810, options [nop,nop,TS val 2210529 ecr 2205432], length 0
16:54:34.460435 IP x.lan.ssh > d.lan.33199: Flags [P.], seq 40:1024, ack 40, win 1810, options [nop,nop,TS val 2210529 ecr 2205432], length 984
16:54:34.462420 IP d.lan.33199 > x.lan.ssh: Flags [P.], seq 40:1312, ack 1024, win 2071, options [nop,nop,TS val 2205433 ecr 2210529], length 1272
16:54:34.502874 IP x.lan.ssh > d.lan.33199: Flags [.], ack 1312, win 2172, options [nop,nop,TS val 2210540 ecr 2205433], length 0
--- cut ---
No need to continue, clearly the first host is talking to the second in the moment of mpi_hello execution. Let's do one more test with John. First execute standalone test, then MPI-enabled one.
cluster@d:~/src/john-1.7.2-bp17-mpi8$ run/john -format=DES -test
Benchmarking: Traditional DES [128/128 BS SSE2]... DONE
Many salts: 1112K c/s real, 1112K c/s virtual
Only one salt: 1011K c/s real, 1013K c/s virtual
cluster@d:~/src/john-1.7.2-bp17-mpi8$ mpiexec -f ~/mpd.hosts -n 4 run/john -format=DES -test
Benchmarking: Traditional DES [128/128 BS SSE2]... DONE
Many salts: 2804K c/s real, 2806K c/s virtual
Only one salt: 3790K c/s real, 3806K c/s virtual
Definitely a different results. First one is run on one core only. In my config second one is run on two hosts, two cores each. The result is not x4, but more than x2, which proves the case. Run tcpdump if you want also, or best start top on second host and watch how john kicks in like this:
top - 17:11:50 up 2:49, 1 user, load average: 0.10, 0.06, 0.05
Tasks: 93 total, 3 running, 90 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.5%us, 0.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 507796k total, 462940k used, 44856k free, 52220k buffers
Swap: 1046524k total, 272k used, 1046252k free, 330820k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6106 cluster 20 0 13248 1692 1420 R 96 0.3 0:03.05 john
6105 cluster 20 0 13248 1720 1448 R 96 0.3 0:03.15 john
988 lightdm 20 0 99028 11m 9492 S 1 2.3 0:35.98 lightdm-gtk-gre
5959 root 20 0 2832 1232 984 R 1 0.2 0:00.10 top
This is it, now you can use your cluster to do some more work.