12:25, Gerd Behrmann: Received reply from NBI that heap size has been increased.
Gerd Behrmann: Send mail to NBI asking them to increase heap size of the pools (I ran out of memory when trying to migrate data to UiO).
Saturday
Gerd Behrmann: The deadlock in dCache was identified and fixed upstream. A new RPM has been build and send to Hans for immediate upgrade.
Friday
Gerd Behrmann: ftp03_dcsc_ku_dk_1 (atlas_tape) has been migrated to Oslo. On attempt to migrate another pool, pools at UiO deadlocked.
Gerd Behrmann: We got word from Daniel Kalici that data files have been recovered, however the file names have been lost. We are currently computing checksums of all files. These can be mapped to PNFS IDs. We hope to recover all files next week.
Thursday
12:50, Gerd Behrmann: Pools at UiO are slowing coming back on-line.
10:50, Gerd Behrmann: Pools have been marked read-only for maintenance.
10:00, Gerd Behrmann: Oslo will deploy a second front-end (se02) for their pools at 11am. se01 will be upgraded to a patched version of dCache, that should have better performance on GPFS.
Wednesday
17:00, Leif Nixon: Summary of NBI situation: two pools lost. Alex is working on mapping file names to DQ2 data sets, and identifying files that are permanently lost. Remaining NBI pools will be migrated to Oslo, since the SAN issues puts them at risk. EGEE broadcast sent to Tier-1 representatives and Productions site admins.
10:53, Gerd Behrmann: The periodic IO-wait at se01 has been identified to be caused by statistics collection. Meta data operations are slow on GPFS, causing high IO-wait. A patch has been submitted to DESY.
10:42, Gerd Behrmann: Daniel called me on the phone. He is unsure about the source of the problem. Possibly the fibre channel switch has a fault, but he still needs to check that. He wasn't happy about copying the files to another host (at least not as the first step). He will initiate backing up everything to tape and then repair the machine. Depending on the time frame we can decide whether to stage data back on the same host or on another host. xfs_repair reports about thousands of lost inodes, which it moves to lost and found, but strangely enough, when mounting the file system, no files have been moved and running xfs_repair again makes it report about the same files. It is currently unknown on which pools we have lost files. If we are very lucky, only local test pools are affected, but it is too early to tell.
Tuesday
21:22, Leif Nixon: Reply from Daniel Kalici - XFS corruption on ftp02_dcsc_ku_dk_2. Pool taken offline and repairs begun. EGEE broadcast sent.
21:03, Leif Nixon: the pool usage page reports "java.io.IOException: Input/output error" for the ftp02_dcsc_ku_dk_2 pool. Mail sent to DCSC support.
16:17, Leif Nixon: Interesting phenomenon on se01.titan.uio.no; every hour at *:55 throughput stops and the machine enters a state of high I/O wait for about 10 minutes. No explanation so far.
14:56, Leif Nixon: Daniel reports troublesome pool machine (ftp02.dcsc.ku.dk) is restarted and both its pools eventually come back online. EGEE broadcast to announce return to service.
12:52, Leif Nixon: Word of mouth (Daniel (phone)> Gerd (jabber)> me) is that due to administrative mistakes, NBI can no longer su to the dCache user, so they can't restart the offline pool at the moment. ETA down from "some days" to "4-5 hours".
10:54, Leif Nixon: Daniel Kalici reports all NBI ftp node installations "broken". Awaiting clarification what that means. EGEE broadcast sent to WLCG T1 contacts with cc to Alex Read. Downtime entered in GOCdb.
08:54, Leif Nixon: Gerd traced Andrejs problem to an offline pool at NBI.
Monday
23:28, Leif Nixon: titan pools restarted by Hans Eide, bringing them back online.
21:45, Leif Nixon: The GPFS file system on se01.titan.uio.no hanged shortly, bringing the dCache pools offline.
11:50, Leif Nixon: Andrej Filipcic reports hanging transfers from srm.ndgf.org