Operation-Procedures Pools

About NDGF

About
People
Meetings
Forums
Steering Board

Documents

Presentations
Technical
Managemental
Links

Activities

Planning
Operation
Middleware
Tier-1

e-Science Projects

CERN
BioGrid
CO2
CC-VO


Added by Wikiuser -, last edited by Gerd Behrmann on Oct 23, 2008  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

Abbreviations

  • RO = Resource owner
  • OoD = Operator on Duty

Reporting problems to NDGF

If your pool suffers from problems like hardware failure, file system corruption, spontaneous reboots, etc., please follow these steps:

  • As it says on the cover of The Hitchhiker's Guide to the Galaxy: Don't panic.
  • Notify support@ndgf.org of the problem, giving as many details as you can about the nature of the problem, symptoms etc. Please do not unnecessarily exaggerate the severity of the problem.
  • The OoD will take action based on the nature of the problem. This may involve one or more of the following steps:
    • Announcing service degradation as an EGEE Broadcast, registering it in GOCDB and as a NUNOC ticket
    • Mark the pool read-only
    • Disable the pool
    • Migrate data to other pools

Please do not announce the problem yourself, as premature, speculative or imprecise announcements tend to do more harm than good.

Shutting down a pool at a site

  • RO notifies the OoD via support@ndgf.org with reason, start time of service interruption and planned duration well ahead of the break.
  • RO waits for OK from OoD

Notice that pools read meta data when starting, which may take 15 to 30 minutes for large pools. Remember to add this to the announcement of planned downtime.

  • OoD enters the degraded service (some files will be unavailable) in GOCDB, which will trigger an email broadcast
  • OoD makes sure the pool is marked rdonly 15 minutes before planned shutdown. To mark a pool read only, login to the dCache admin interface on dcache.ndgf.org
 ssh -c blowfish -p 22223 -l admin localhost

:: and do

 cd PoolManager
 psu set pool <name> rdonly
 psu ls pool -l

where <name> is the same as the cell name in https://chaperon.ndgf.org:2288/usageInfo

 and the last command gives a nice long list of all the pools and their status.

 You can verify that the command succeeded by checking that rdOnly=true for that pool

  • RO emails support@ndgf.org when the pool is available again
  • OoD makes sure the pool is enabled
 cd PoolManager
 psu set pool <name> notrdonly

Recovering a pool that hangs or is off-line

A pool may go off-line for many reasons. When it does, it will be marked as "lost" on the "Pool usage" page in the dCache monitor, or disappear completely. If you suspect that the pool hangs (i.e. there are no other obvious explanations like connectivity problems), then email the resource owner (cc to support@ndgf.org) and ask them to

  • Run the following
 /opt/d-cache/bin/dcache dump threads pool
  • Email a copy of the log file to behrmann@ndgf.org
  • Restart the pool
  • Notice that restarting a pool with a lot of data takes a while (approx. 30 minutes)

How to restart individual pool domains

If several pools run on the same host, but in different domains, it is possible to restart them individually. The dcache init script can either start and stop all pools or used to restarting individual domains.

To start or stop individual pool domains, you do

 /etc/init.d/dcache status

Identify which domain contains the broken pool, then:

/etc/init.d/dcache restart pool_nodeDoman0x

Instead of "restart" you can also "stop" or "start". The string "pool_nodeDomain0x" is replaced with the appropriate domain identified from status.

Moving a pool to a new host

It is possible to move a pool to another host (say, because of fatal hardware failure). It should be noted that even though pool names at NDGF are derived from host names, pools must not be renamed during this procedure (there is a separate procedure for renaming pools).

Do not move pools without notifying NDGF and negotiating a time window for the procedure.

Important: Do not attempt to merge two pool directories! When moving a pool to a new host, that pool will run alongside other pools that may already exist on that host.

A pool is defined by the following items (also read Technical-DCache PoolInstallation):

  • The pool directory. It contains a control/ directory with meta data, a data/ directory with data, a one or more setup files.
  • An entry in the pool_name_n.poollist file stored in the config/ directory of the dCache installation. This entry contains the path to the pool directory, the name of the pool and various settings.
  • An entry in the hostname.domains file stored in the config/ directory of the dCache installation. This file contains an entry per poollist file.

Follow these steps to move a pool. Let's assume you move pool POOL_1 from SOURCE to DEST. SOURCE and DEST may - depending on context - refer to hostname, FQDN or FQDN with underscores instead of dots. If in doubt refer to Technical-DCache PoolInstallation.

  • Make sure the pool is shut down (follow the procedure for shutting down a pool).
  • Copy the directory containing POOL_1 from SOURCE to DEST (take care not to overwrite files at DEST!).
  • Copy the SOURCE.poollist file from SOURCE to DEST (if you followed the NDGF pool naming convention, then that file name should already be unique).
  • Edit the DEST.domains file and add a line for SOURCE
  • Restart the pools at DEST

Important: Do not start the pool at SOURCE after you moved the files. The importance of this item cannot be stressed enough!

Upgrading dCache

  • Never upgrade to a different release series unless told to do so by NDGF! All pools need to stay at the same release series (not even dot releases are allowed to get out of sync), although differences in patch level are acceptable.
  • Schedule a time window for the upgrade by following the procedure for shutting down pools
  • Follow the upgrade instructions on Technical-DCache PoolInstallation

Tape pool

[this documentation is not complete yet, maswan has all the details]

  • dCache needs both read and write pools. These can be colocated on the same filesystem, or on separate filesystems, depending on the local situation
  • Tape pools should not have lfs=precious in the poollist definition
  • Read pools can have the full filesystem size as pool size
  • Write pools also need space for the "out" directory (size determined by maxusage in endit.conf)
  • For delete to work you need to set hsmCleaner=enabled in dCacheSetup
  • the hsminstance is a unique identifier of the TSM namespace, and should be the same in all pools that share the same tsm-node and out directory