Tuesday, June 26, 2007

bdii counts


Promised to monitor the bdii. This is the plot of the bdii count a while ago. I'll have to redo it for a longer period. It seems clear that it is not the entire site bdii that disappear but only individual entries. Which is very probably correlated with load. We have seen it with the ce mds.

Tuesday, June 19, 2007

RB very slow

Yesterday I have been wrestling with our RB. I takes several hours for a job to go from waiting to scheduled which means that the matchmaking process is overloaded. I think the reason was that the database was very big (4GB). Exacly 2^32. As suggested here I cleaned the database and it seems better now. The problem is that I never got to the root of what was going wrongly...

dCache failures (dcache-server-1.7.0-36)

Again this morning we have pools going down with a memory allocation problem:
--
06/19 00:45:58 Cell(sedsk01_5@sedsk01Domain) : Thread : ping got : java.lang.OutOfMemoryError: Java heap space
--
I think the only way we will solve this will be to get hold on a dCache developer that can have a look. Clearly we did not have this problem when we where running the previous version (release 35).

Monday, June 18, 2007

dCache pools went down

From friday afternoon several dCache pools went down. It ran out of memory, and here is the content of the sedsk01Domain.log file.

06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at java.lang.Thread.run(Thread.java:595)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Storing incomplete file : 0003000000000000006E0B80 with 2756018417
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Stacked Exception (Original) for : 0003000000000000006E0B80 <-P---------(0)[0]> 2756018417 si={cms:cms} : CacheException(rc=10006;msg=Pnfs request timed out)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : Stacked Throwable (Resulting) for : 0003000000000000006E0B80 <-P---------(0)[0]> 2756018417 si={cms:cms} : CacheException(rc=33;msg=Illegal State Transition -P-------- -> -P--------)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : CacheException(rc=33;msg=Illegal State Transition -P-------- -> -P--------)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrimaryState(CacheRepository2.java:107)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrecious(CacheRepository2.java:219)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.repository.CacheRepository2$CacheEntry.setPrecious(CacheRepository2.java:215)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1538)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
06/15 16:32:13 Cell(sedsk01_1@sedsk01Domain) : at java.lang.Thread.run(Thread.java:595)
06/15 16:35:02 Cell(c-100@sedsk01Domain) : runIO : java.lang.OutOfMemoryError: Java heap space
06/15 16:35:02 Cell(c-100@sedsk01Domain) : java.lang.OutOfMemoryError: Java heap space
06/15 16:35:02 Cell(c-100@sedsk01Domain) : java.lang.OutOfMemoryError: Java heap space
06/15 16:38:25 Cell(c-100@sedsk01Domain) : runIO : java.lang.OutOfMemoryError: Java heap space


dCache is started with those parameters:
-server -Xmx512m -XX:MaxDirectMemorySize=512m

We don't know what happened.

Friday, June 15, 2007

Dataset access problem at IC-HEP

Some users are experimenting datasets access problems at IC-HEP. The ticket in question is GGUS 22106. The problem is that our cms users don't have the problem for the same dataset.
This raises the question on how to debug those problems when you don't have users on hand. In this case the only solutions will be to do it interactively with the user.

SAM Failures in London

Summary of SAM failures and solutions
  • mars-ce2: CA certificates updated but permissions where wrong for the lt2-lcg group and hence the certs where not readable. Fixed now
  • hep-ce:
    • Update of the images. Missing ssl and uuid libraries caused the lcg-cp tools to fail. Matt solved this
    • updated the CA but unfortunatly the crl cronjob did not run since it is being run by mona. Now fixed
  • gw-2 (UCL-CENTRAL): Investigated intermittent failures and discovered that the sam jobs are sometimes killed by sge which has a vmem limit of 2GB. The problem is that python when creating a new thread tries to use the max stack size of the parent process. Since sge set this with a very high value any new thread will thread will try to create a big stack and the vmem limit will be reached. The solution is to change the max stack size in the sge configuration. We tried a ulimit -s 10 in the jobmanager but since then gw-2 is failing the ops test consistently. William has been contacted the revert back this change and make the modification in the sge queue configuration.
    • Note: this problem was seen on the ic-hep cluster (ce00) and fixed using the stack size limit.
  • ce1.pp (RHUL): gatekeeper problem, it seems I cannot access with the ssh keys I am using at home. Have to check from IC.
It's a black week for the availability in London...