| Author |
Gerd Behrmann, NDGF |
| Created |
24th of October, 2006 |
| Last modified |
14th of November, 2006 |
| Document status |
Not reviewed |
| Disclaimer |
This document represent the authors current understanding of the world. It may be and most likely is inaccurate. |
The following describes initial ideas/plans for NDGF contributions to
dCache. These ideas are a result of internal discussions of our goals
followed by reading through the dCache source code.
Cell communication and authentication
dCache is split into cells (active objects) running in domains (Java VMs). The cells communicate by message passing. Messages are routed between cells using the location manager: At startup each cell registers itself with the location manager and after that the location manager forwards messages between cells.
Communication between cells is unencrypted and unauthenticated. This is unacceptable for deployment over different security domains. It is unclear if host-level authentication is sufficient or whether we require separate authentication of each cell. The latter will require much more administration. Host-level authentication should be easy to obtain by using the host certificate of each host. Communication could be encrypted by using SSL, which is well supported by the Java platform.
PNFS
PNFS is an NFS server, which dCache uses to store meta data. Historically, the PNFS file system had to be mounted on all hosts running dCache cells. This requirement has been relaxed in recent releases of dCache. Obviously, we cannot allow NFS to cross security domains, thus at least the following requirements need to be resolved:
- PNFS has to be mounted on pools capable of writing to tape
- PNFS may be required at GridFTP and SRM doors (unconfirmed)
We have to remove these requirements. I wonder if - over time - we could get rid of PNFS completely. dCache will remain a centralized system as long as it depends on PNFS. If we can get rid of PNFS, this may pave the way for a peer-to-peer version of dCache.
SRM: Door selection
The comments below apply to both SRM get and SRM put operations.
The current SRMDoor selects randomly among the least loaded doors
providing the requested protocol (e.g. GridFTP). Such doors however
are not necessarily co-located with storage pools. There is obviously
a good reason for this: The GridFTP door is much more than just a
backend for the SRMDoor. It provides access to the whole logical name
space of dCache and can be used to access any file in the dCache
system.
For active transfers (server opens data connection to client), this
situation is not so bad, since the GridFTP door asks the storage pool
to connect back to the client. Thus data is transfered directly from
its location to the client. For passive transfers (client opens data
connection to the server), data is streamed from the storage pool to
the GridFTP door and from there to the client. If the
storage pool and the GridFTP door are not placed next to each other
(i.e. not in the same room). The transfer overhead may be
substantial.
The existing GridFTP door cannot resolve the aforementioned issues for
passive transfers, as these are inherent limitations of the FTP
protocol.
I see two possible solutions (one being a variation of the other):
- Co-locating GridFTP doors with storage pools
- New GridFTP mover
Co-locating GridFTP doors with storage pools
If the GridFTP door is co-located with the storage pool (either on
the same host or nearby), then all we need to do is to let the SRM
door select the appropriate GridFTP door. Thus for each pool we have
to generate a list of allowed GridFTP doors (or any other kind of
door providing a transfer protocol). The SRM door will then need to
resolve the logical file name to a pool, map the pool to a list of
doors and then select a door with a low load from that list. Rather
than associating the pools with a list of available doors, we could
assign weights to each door influencing the decision. The SRM door
should of course select a pool based on load information and
integrate with the hotspot detection of dCache.
This change by itself is not enough, because the GridFTP door will
itself resolve the logical filename to a list of pools and select
one of them; this may be a different pool than the one the SRM door
selected. Thus the GridFTP server should select a nearby pool.
In case a pool is on the same host and if passive transfer is requested, it should
obviously select that pool rather than any other pool (irregardless
of the load of that pool - here I assume that the load on the pool
will not be higher than the load on the GridFTP door for acting as a
proxy for the passive transfer). This assumes that the file is
actually online.
If the file is not online at the local pool we have to assume that
the file is not online at any pool - otherwise another pool would
have been selected (I am guessing here). However, the local pool may
not be the correct location to bring the file online (certainly not
in the general case where the GridFTP door was not selected by the
SRM door). Thus the SRM door should probably have brought the file
online and then select a nearby GridFTP door (TODO: Check if the SRM
door is already doing this).
New GridFTP mover
Some of the challenges above are caused by the GridFTP door operating on
the logical name space. There is an alternative however. To
understand this one must first realize how data is moved in and out
of a storage pool: Any request to transfer data in or out of a pool
is mapped to a MoverProtocol implementation. This MoverProtocol
implements door specific transfer routines. Thus e.g. the HTTPDoor
transfers data out of the pool by redirecting the client to a
temporary HTTP server implemented by the HTTP MoverProtocol. The
GridFTPDoor on the other hand causes the GridFTP MoverProtocol
implementation to connect directly to the client.
Thus an alternative solution would be to add MoverProtocol
implementations specific to the SRMDoor. For GridFTP, this
MoverProtocol implementation would implement a minimal GridFTP
server providing access to the file being requested. The SRMDoor
would then generate a transfer URL pointing directly to the physical
file on a specific storage pool (in contrast to a URL to a logical
name on a GridFTP door).
The obvious benefit is that even passive transfers can be
implemented with excellent performance. The downside is that we
would need to handle each protocol differently, i.e., we would need
several MoverProtocol implementations. The current approach can be
used as a follback solution for rarely used protocols.
This solution obviously provides better performance than co-locating the existing GridFTP doors:
Even when the GridFTP door and the storage pool are co-located on the same host,
passive transfers cause data to be moved between the two domains, thus
inducing an increased load (mental note: When GridFTP door and storage
pool are on the same host, the the GridFTP MoverProtocol
implementation could handle passive transfers directly!).
GridFTP Features
- Feature: RETR restart is not implemented, at least not for RETR. This should be rather easy to add since ERETR with partial retrieve mode is implemented.
- Compatibility: Currently, only the alternative syntax for extended retrieve and extended store is implemented. The syntax described in the April 2003 GridFTP spec. should be implemented.
- Feature: Striped retrieve should be possible to implement by reading from several pools at the same time. The opposite direction is not possible, since dCache cannot stripe files over several pools (right?).
- Minor detail: ac_eretr() calls ac_retr(), however ac_retr() uses RETR as the command name in error messages. The same is the case for STOR.
- Minor detail: GFtpProtocol_1.sendStream(), line 262 and 415: This test should be moved out after the while loop; otherwise it is non-deterministic whether we throw an exception or we silently jump out of the loop (depending on when the thread is interrupted).
- Minor detail in AbstractFtpDoorV1.java: askForReadPool() can ask for both a read and a write pool, thus the name is rather misleading. Also, RETR does not even use this method; instead it asks the pool manager directly. This seems to be a historic cut'n'paste problem. The same seems to be true for askForFile(); it may be used for transfers in both directions, however only STOR uses it.
- Performance comment: Both HttpConnectionHandler and GFtpProtocol_1 copy a file to a socket by repeatedly reading data from the file into a buffer and writing the buffer to the socket. Using NIO and the transferTo() method of the FileChannel class, the overhead of sending a file can be considerably reduced: transferTo() is potentially (depending on the platform) implemented via the sendfile() system call which avoids copying the data to user space and back to kernel space. The same is probably true in all other movers.
- Additional comment: There is already GFtpProtocol_1_nio, however it does not use transferTo(), thus probably not gaining that much performance.
- Additional comment (speculative): For receiving data, there is no transferTo() equivalent. The proper approach is to map (parts of) the target file into memory (using a MappedByteBuffer) and read data from the socket directly into that buffer.
Notice that the main benefits of the NIO changes are not directly
higher throughput, but rather lower CPU load. This however will free
resources that may lead to higher throughput.
Distributed dCache
Terminology: A site is a non-empty set of hosts, typically situated
physically close to each other. Let S be the set of
sites. Let l in S be a special site which we call the
local site. We call R = S \ l remote sites. Let c be
a client host, which may or may not be in any of the
sites of S.
Ideally, we would want to have a dCache installation at l, which can
use remote pools at sites in R that are either other dCache
installations or simply storage sites providing GridFTP and/or SRM.
Currently, this can only be achieved by using the HSM support in
dCache. l would see sites in R as different tertiary storage systems,
staging data in and out via scripts using, for instance,
globus-url-copy to move the data. Since data has to be staged to local
disk at l before being accessible and since data has to physically
flow through l, l obviously becomes a major bottleneck in the
system. We would need a relatively large number of hosts at l to keep
up with data flowing in and out of the system.
Instead, our idea is to add another pool implementation; let's call it
RemotePool. The RemotePool cell would run on a host at l, but be
configured to store data at a remote site r in R. We would need to
follow the current pool cell interface, including the MoverProtocol
interfaces. Our MoverProtocols would then talk to r and set up the
transfer directly between r and the client c or between r and the door
at l which c uses (in the latter case data would obviously still flow
through l).
We would most likely need a RemotePool implementation for each type of
remote system; communication with the remote site will depend a lot on
whether we have SRM, GridFTP or dCache interfaces. If the remote
system is dCache, we could even have a specialised door on the remote
system (if needed).
Let's assume we have a remote site, r, with an SRM interface. We would
implement an SRMRemotePool cell at l. An SRMDoor at l would talk to
the SRMRemotePool to put data into or get data out of the pool. There
would be an SRMRemotePool specific SRMMover to do the dirty work:
Interesting aspects are bringing data online, performing a space
reservation, putting data, getting data. All these operations can be
performed via the SRM interface of the remote system. For putting or
getting data, the SRM at r would generate a TURL that the SRMDoor at l
can return to the client, c. Thus data movement would be directly
between c and r. Transfer protocol support will be limitted to what is
supported by the SRM at r. An HTTPDoor at l would be limitted to read
access (like the current door). l would use SRM on r to obtain an HTTP
TURL for direct access to r and return a redirect reply to c. c would
then use the TURL to retrieve the file directly from r. This obviously
only works if r supports HTTP. A GridFTP door can implement direct
transfer between c and r in active mode. l would use SRM at r to
obtain a GridFTP TURL, log in to r using GridFTP and setup an active
transfer asking r to connect directly to c. Passive transfers will
have to go through the GridFTP door at l.
TODO: I noticed that the pool cell has "proxy support". What is this
about?
Packaging
Packaging for Debian and Ubuntu would be nice to have. If we invest time in
this, we should do it right: The existing RPMs do little more than copy
the files into /opt/d-cache/. Our .deb should perform most steps described in the
basic installation guide, i.e.
- create required postgresql databases (Debian/Ubuntu have infrastructure for letting packages create and initialise new databases)
- configure java paths
- perform basic setup for dCacheSetup and node_config files
Server packages should be divided into at least:
- common files
- admin cells
- pool cells
- srm door
- gridftp door
The admin packages could be further subdivided. This would substantially ease installation on Debian and Ubuntu.
Other (non-tier 1) issues
Threading
It seems that dCache uses a massive amount of threads for I/O: Consider moving to
NIO, asynchronous I/O, and a common I/O multiplexing framework decoupling threads from I/O.
In many places threading and functionality are mixed. I.e. the SocketAdapter spawns a new
thread for itself: Code quality would improve if these issues were decoupled, e.g., SocketAdapter
would just contain its functionality and then the client code would decide to either spawn a new
thread or not (the client code may already run in a task specific thread, not needing to spawn a
new thread). The client can also choose to submit the task to an execution service (JDK 1.5 feature),
utilising a thread pool.
Message passing and RMI
The cell architecture is rather nice, however, much of the logic is cluttered by message passing
issues being intermixed with functionality. A proper abstraction over the message passing interface
would clean up things. This could be done by hand for each cell by writing stubs on the client side
and skeletons on the server side. To be fair, this seems to be done for some cells, e.g., PnfsHandler
seems to be the client side of the PoolManager. This process can however be automated by using reflection.
Automatic skeleton generation using reflection is already implemented
for string messages (which are typically send from the CellShell). What I
propose is that the same is done for RMI.
A prototype of such an RMI abstraction over the cell architecture has already been implemented.