Friday, March 11, 2011

RHUL cluster expands


Yesterday, RHUL took delivery of new storage and compute nodes to beef up its Tier2 cluster.
The GridPP and CIF funded kit was supplied by Dell and is being installed and configured by Alces.
The extra 6.3 kHS06 and 420 TB will more than double the capacity of cluster.
Once the installation is complete and accepted, work to integrate it with the existing cluster and bring up the gLite services will begin.

Friday, February 19, 2010

RHUL 'Newton' cluster comes home

After two years hosted by Imperial College, our 'Newton' Grid computing cluster has finally been relocated to Royal Holloway's new state-of-the-art computer centre. The move was carried out by Clustervision and everything went smoothly. Before the cluster goes back into production, analysing LHC data, a software upgrade to SL5 is planned.

A small part of Newton remains at IC: the racks were donated to become part of the particle physics cluster.

Friday, July 31, 2009

Comparing ATLAS analysis at RHUL using the file-staging and RFIO approaches

I have been looking at the performance of the Royal Holloway cluster during Hammercloud tests in which data was accessed directly from the DPM pool nodes using the RFIO protocol and comparing it to the recent UK-wide file-staging test (540).

For the RFIO approach two identical tests (537 and 538) were requested in order to ensure enough jobs arrived on site. The RFIO IOBUFSIZE was set to 4KB. Job CPU efficiencies and cluster throughput (the product of number of running jobs and average job efficiency) were extracted using Sam and Dug's script. The job throughput climbed steadily up to a peak at around 320 running jobs. At this point the throughput started to decline probably compounded by the fact that one of the disk servers lost a disk and became over-loaded.


The CPU efficiency declined relatively consistently as the number of running jobs increased:


Each job was reading data at about 1 MB/s so that at the peak the total bandwidth was around 350 MB/s - roughly 30 MB/s per disk server. The disk servers were working hard, however, the iostat %util values were around 100% with high cpu iowait values.

So how do these results compare to those obtained when staging files to the worker node prior to analysis? This graph shows the same RFIO throughput data together with results from the recently run file-staging test:



The throughput during file-staging leveled off earlier - at around 175 running jobs. Similarly the average job efficiency drops more steeply:


The job failure rate for the RFIO tests was 4% compared to 17% for the file-staging test.

Wednesday, May 13, 2009

RHUL getting good rates into MCDISK from RAL

RHUL has regularly got good rates and by that I mean 80-100 MB/s from Fermilab when downloading CMS data. It nice now to see similarly high rates downloading ATLAS data into the MCDISK space token from RAL.


Wednesday, April 16, 2008

Exercised space token creation at UCL-HEP

Thought it was neat to give it a try and created as a test a small reservation for dteam, following the instructions on the LCG Twiki. All went well and all the tweaks for SL3 / gLite 3.0 worked well. Only oddity was that:
[root@pc55 root]# dpm-reservespace --gspace 10M --lifetime Inf --group lcgdteam --token_desc dteam_10M
send2nsd: NS009 - fatal configuration error: Host unknown: UNUSED
invalid group: lcgdteam
but:
[root@pc55 root]# dpm-reservespace --gspace 10M --lifetime Inf --gid 2688 --token_desc dteam_10M
worked well. Perhaps due to the fact that the group id is not the same as the VO name?? (tried also with 'dteam' in place of 'lcgdteam', but had the same error.

Wednesday, March 26, 2008

RHUL aircon problems

Our machine room aircon system broke down last week and the temperatures have been all over the place.

After a few days of summer clothing and a few nights of temperature alarms, it was diagnosed to be a refrigerant gas leak from the chiller on the roof. The bad news is that this takes 2 weeks to fix. Luckily the estates engineer was very efficient and organised the delivery and connection of a backup chiller on the last day of term, then personally looked in over Easter to keep an eye on it.

It has been stable the last few days so I've just brought the cluster back up. The site will come out of downtime this evening.

Tuesday, August 07, 2007

UCL-HEP APEL accounting fixed

After upgrading to gLite r27 on the 4th of July, APEL stopped publishing to the central RGMA registry. The apel-publisher failed with a not handled
RGMABufferFullException
To fix this, we had to update to the latest version of the APEL rpm's (2.0.5-1) on the MON and CE and re-run YAIM on both