Sunday, April 21, 2013

The art of cabling

The challenge of organising your cables behind your TV is nothing compared to that of a large computing cluster.

One of our standard racks contains 12 Dell R510s servers (for storage) and 6 Dell C6100 chases (providing 24 compute nodes) all 36 nodes are connected with a 10 Gb (SFP+), 1 Gb (backup) and 100 Mb (for IPMI) network cable. Connecting to 3 different network switches at the top of the rack. In addition the 18 "boxes" need a total of 36 power connections. A total of 144 cables per rack!

How to cope? Separate the network cables from the power cables, a possible source of noise. Use different colour cables for the different traffic and add unique id number for each cable. Use lose, removable cable ties. When a cable brakes don't remove it, just add a new cable.

The 10 Gb switches, in our case Dell S4810s, connect using 4 40Gb QSFP+ cables to two Dell Z9000 core switches. Having two core switches allows us to take one unit out of service without downtime (we use the VLT protocol and it works!). However this does add cable complexity. The backup 1 gig switches connect to each other in a daisy chain using 10 Gb cx4 cables, left over from before our 10 Gb upgrade. Finally the ipmi switches connect to a front-end switch using 1GBaseT cables.

The picture shows the inter-switch links. Visible are the orange 40Gb connections and blue 10Gb cx4 cables. In addition each 40 Gb cable has an ID indicating which rack it came from and which core switch its going too.

We have one rack full of critical, world facing servers. These servers need to be available all the times making it very difficult to reorganise the cabling. As a result over time, as we add and remove servers, the cabling becomes a mess. This is starting to become a risk! We are just going to have to accept some down time to sort it out in the near future.

Monday, April 15, 2013

virtualization performance hit

Like the rest of the world,  there is a lot of discussion going about the use of clouds and virtualization in gridpp.

Using virtualization will have a performance impact, so using it for our type of computing (hpc/htc) may not be the best solution. However just what impact does it have? A quick search of the web suggests anywhere between 3 to 30%. Most of the overhead appears to be in the kernel and in i/o.

I decided that I wanted to do some of my own tests with the focus on the type of work we do in gridpp. 

Testbed: 24 thread westmere processor running at 2.66 GHz + 48 Gig of memory using Scientific Linux 6.3 (basically RHEL6). I'm using the default install of KVM with the virtual image as a local file setup to use all 24 threads.

Benchmarks: 1) I unpack and make the ROOT analysis package using 24 threads; 2) as 1 but using only one thread. 3) I generate 500,000 Montecarlo events using the HERWIG++ Generator; 4) as 3 but I also include the time taken to unpack and install HERWIG++; 5) I run the HEP-SPEC06 benchmark. For tests 1 to 4 i use the TIME command to obtain the real time taken (smaller is better), for 5 I report the hep-spec score (larger is better).  I will run the benchmarks on the bare metal install and on the VM on the same hardware and compare the results.


Out of the box performance of KVM results in ~3% (CPU intensive) to 20% (sys call intensive) reduction in performance. There is some indication of correlation with ratio of sys time / user time (particular effect with make/tar/gzip?). This is not seen in HEP-SPEC result. SYS time is the CPU time spent within the kernel and from previous studies we expect this to incur a high performance hit in virtualization.

If I get the time I intend to repeat analysis using optimisations (e.g. guest image on LVM). Repeat analysis using fedora 18 ( ~RHEL 7). Repeat using sandybridge cpu. Look at network performance (eg iozone with lustre).

Wednesday, April 10, 2013

The Queen Mary Grid Cluster

The qmul grid/htc cluster is a high throughput (htc) research computing cluster based at Queen Mary, University of London. We primarily serve the scientific grid community and are funded by the griddpp
collaboration (i.e. uk stfc research council). By high throughput we mean the ability to do lots of individual separate jobs. Our main workload is data analysis for the ATLAS experiment at cern. We are the top site in the UK for this type of work, and one of the leading sites for the ATLAS LHC experiment in the world. We are part of the LondonGrid (hence the post to this blog!)

Our cluster comprises of:

For running the actual jobs
30 Dell C6100 using X5650s processors, contributing a total of 2880 job slots, and
60 older streamline nodes using E5420 processors, contributing a total of 480 job slots.

For Storage we run the Lustre parallel file system using
72 Dell R510s with 1800 TBytes of disk and
12 older Dell 1950s with MD100 disk arrays with 360TB of disk
Our actual provision is about 1600TB due to the use of raid 6 and "real" disk sizes.

We have a lot of development work to do over the next year which I hope to describe over the coming month in this blog including...

  • A new monitoring system probably based on opennms.
  • A new deployment system, to replace our hand made perl/mason/kickstart system probably using razor and puppet.
  • A cloud stack, we've been doing scientific computing using the grid software, but this model of computing is likely to be replaced with a cloud type model, we will need to look at the various options (OpenStack, CloudStack or OpenNebula).

The 11 racks of the QMUL cluster