Interfacing Panda with NorduGrid ARC
With the ATLAS experiment going for a unified production and analysis system based on the OSG Panda system, we must consider how to best interfacing Panda and ARC.
The basic assumptions made in the following are:
- The effort to create on single system for all three grid middlewares used in ATLAS is an effort that should be supported. Especially for analysis having all three grids accessible through a single interface will be a great advantage to physicists.
- NorduGrid ARC has in the past shown to be one of the best performing grid middlewares, especially has its decentralized structure show to be robust and scalable and it mechanisms for staging and caching input files has resulted in overall high success rate for job completion and with minimal use of cpu time at compute nodes.
- NorduGrid ARC sites running with user pool groups have had system supported separation between jobs running from different grid users - even on shared memory systems.
Following is short recap of the Panda system based on https://twiki.cern.ch/twiki/bin/view/Atlas/Panda, which will be used in assessing how to interface Panda and ARC in the most suitable way.
The overall structure of Panda is depicted in the following figure (from [1]): 
This figure looks to extend quite similar to what we know from ARC. At a site there is frontend node running some kind of storage element, some kind of information system (site info services and site capability service), but there is no real grid-manager responsible for staging input-files, controlling jobs and staging output files.
In order to illustrate the Panda jobs flow, lets run a simple example.
- Some job A is submitted, it ends up in the task buffer at the Panda server (one question - is there only one Panda server?).
- Since (as stated in the document on which this text is based) "Data pre-placement is a strict precondition for job execution: jobs are not released for processing until the data arrives at the processing site. When data dispatch completes, jobs are made available to a job dispatcher" the Panda server must decide which site(s) should run jobs A and start the process of uploading the needed input files to the site.
- All needed input files have a arrived at the execution site and a pilot job start running the job.
- Upon job completion regenerate output is copied from the site - it is not clear from the text if this is performed by the Panda server or by the job itself.
The most notable differences between the Panda system and NorduGrid ARC are
- Data is pushed in Panda and pulled in ARC.
- Pilot jobs are used in Panda to control which job is executed at a site.
Issue 1 is not that significant, if data pushed to a site by some external entity (this is in fact supported by ARC, by enabling it in the arc.conf with some fancy url substitution for avoiding copying the file at site later) or pulled by the grid-manager into the session directory is not that big of a difference as long as we assume that the job itself is not started before all input data is available. One difference is that the pull model allows the site to control the download rate. One can easily imagine that the Panda server instead of starting upload of input files to a site, instead submit an xrsl to that site and then let the site download input files itself.
Issue 2 then - what does that pilot jobs imply? It is often said the advantage of pilot jobs is that they make it easier to realize some kind of a semi fixed load share between types of jobs and realizing an interjob priority. This is however not really the case.
- First, observer that it is the central queue and the fact that it is only the central queue that is in use that makes these two things easier - not the pilot job.
- Secondly, it is only true if a pilot jobs get to run. However - remember that a job can only be allocated to a site if the input data is available - this mean that we have a limited number of site where a job can possibly run, and that the decision on which site to upload files to are made some time before the job can get to run. On site dedicated to ATLAS this is not necessarily a big problem - but if the site is dedicated to ATLAS, then it is equally easy to keep the queue short and submit jobs from the central queue. On sites supporting more than ATLAS pilot jobs does not really improve the situation, as jobs from other user groups can prevent the pilots from running.
- Furthermore on pilot jobs - it is not clear if the Panda way of running pilots provide any system supported separation between different grid users. (note from gronager: It does not - this is left to the underlying grid system - OSG/LCG/ARC)
The conclusion is therefore, that the Panda way of managing jobs have no real advantages compared to NorduGrid ARC. And that the interfacing between Panda and ARC should be based on ARC mechanisms. <list advantages of arc way of doing thing>
Other issues
- Use of LFC - not an issue, ARC can be compiled with support for that, it is reported to work
- Detector conditions data - can be supported through the caching mechanism in ARC, by direct upload to an local SE or perhaps using dynamic RTE.