Tuesday, February 20, 2007
RB Wrestling the comeback
This morning looking at the monitoring our RB does not look happy. You can judge yourself on the plot below. It clearly seems that when the submission rate is too high the workload manager can just not eat the jobs fast enough to reduce the queue length. I have asked help from Maarten, we'll see what he come up with. I think I will have a look in the rb code to find out what is going on...
Monday, February 19, 2007
Certificates and Mars
I was worried by the low number of jobs at LeSC. There was not much jobs there.
It is very difficult to get an hold on the output files of failing jobs. Thanks to sge we can find out where it is located
globus_i_gsi_gss_utils.c:2155: globus_i_gsi_gssapi_init_ssl_context: Error with openssl: Couldn't open bio for reading on file: /homes/lt2-lcg/grid-security/certificates/47d3d1a0.0
and that is because when untarring the files in the certificate directory one of the certificate
was not readable by the lt2-[users]. This is now fixed and I will chase up lhcb to understand
if they can run there without problem.
It is very difficult to get an hold on the output files of failing jobs. Thanks to sge we can find out where it is located
- qstat -j jobid will print out the std.err and std.out of the jobid given
globus_i_gsi_gss_utils.c:2155: globus_i_gsi_gssapi_init_ssl_context: Error with openssl: Couldn't open bio for reading on file: /homes/lt2-lcg/grid-security/certificates/47d3d1a0.0
and that is because when untarring the files in the certificate directory one of the certificate
was not readable by the lt2-[users]. This is now fixed and I will chase up lhcb to understand
if they can run there without problem.
Wrestling with our RB
We are still observing very long time (several minutes) to have a job going from the waiting state to the scheduled state. This means that the network server of the rb is accepting the job but the workload manager is running out of steam to process it and do the match making.
- I monitored the rb by looking at the number of entries in the input queue (/var/log/edgwl/workload_manager/input.fl). Checked the number of entries that matches the regular expression ("g$").
- Plotting the number of entries waiting to be accepted by the workload manager as a function of time. The result is here.
- You can see a clear drop at the end of the x range. I think this is because I have reduced the number of threads for the network server and increased that number for the workload manager. The file to look at is /opt/edg/etc/edg_wl.conf .
- For the NetWorkServer:
- MasterThreads = 4;
- DispatcherThreads = 6;
- For the WorkLoadManager
- NumberOfWorkerThreads = 10;
Monday, February 12, 2007
Resurrection of the Blog
ICT the Grid
Today we are back in business with ICT to get their cluster on the Grid.
Today we are back in business with ICT to get their cluster on the Grid.
- They will provide one machine and install RHEL3 i386 so that we don't have the RHEL4 problem.
- We have to find out how to modify the information system since they are running pbspro which does not have exactly the same commands as pbs.
- They will create the pool accounts and we have yet to make sure that we can run prolog scripts to get the lcg environment correct
- Atlas cannot install the tags. They tried to install the new software but it is not published correctly.
- Maybe this is because we are publishing another subcluster to publish the 64 bit queues. I'll make a wiki entry with explanations how this was done. The dynamic information does not seem to be correct though.
- Mona has enabled camont and total on our rb. We now need a site to test it and we also need to enable it on the lesc and ic-hep ce.
Subscribe to:
Posts (Atom)