MonaLisa-SGAS Client notes

About NDGF

About
People
Meetings
Forums
Steering Board

Documents

Presentations
Technical
Managemental
Links

Activities

Planning
Operation
Middleware
Tier-1

e-Science Projects

CERN
BioGrid
CO2
CC-VO


Added by Csaba Anderlik, last edited by Csaba Anderlik on Dec 05, 2007  (view change)

Labels:

Enter labels to add to this page:
Wait Image 
Looking for a label? Just start typing.

MonaLisa Client monitoring ALICE JobAgents (JAs) running on NDGF sites  (documentation still under development ...)

The implementation follows the same principle as the SGAS JARM component, and has the functionality to monitor all the JAs running at the NDGF sites:  Aalborg, CSC, DCSC_KU, Jyvaskyla, LUNARC, NSC, UiB. Whenever a JA changes
 status to DONE a corresponding JobUsageRecord is created in the form of an xml file on the local disk. In regular intervals these xml files are then cat together, and published to the NDGF LUTS:
*https://i099.hpc2n.umu.se:8443/wsrf/services/sgas/LUTS*
using a modified version of the sgas-publish utility from the sgas-client.jar package.  For the publishing part to work, a usual X509 security framework needs to be in place (ca certificate files + a valid proxy certificate, using a DN which is allowed to administer the NDGF LUTS).

All the above components are contained in the MLSGASclient-04.12.2007.tar.gz package.

Structure of the directory after unpacking the client

./src: contains the source file for the client + the compile.sh script to compile the client
./conf: contains the configuration file for the client
./bin: contains the binary client + the scripts for running, updating the libraries and publishing the xmls to the NDGF LUTS
./lib contains the supporting libraries from MonaLisa

./sgas - contains SGAS components (all the libraries + one script for publishing, the  sgas-client-2.0.jar from sgas/lib has a modified version of the class se.sgas.client.Publish in order to allow publishing UsageRecordSets from xml to LUTS)

./log - the client stores by default here the JobUsageRecord xmls, its subdirectory will hold the aggregated and publish UsageRecordSets

./certificates - has the NorduGrid CA certificate stuff

 
Installation is simple, just untar/unzip. To run the client: cd to the bin directory and execute the run.sh script.

The directory where to store the xml UsageRecords can be set in the configuration file for the Client (./conf/,App.properties) the logging level and the name of the sites to be monitored can be set there as well. By default the records are stored in the log subdirectory.

The information in ML is structured using so called predicates, each predicate has the following syntax:

(farm name, cluster name, node name, time interval, actual parameter name)

The client monitors the following parameters related to the *_JobAgent nodes:
cpu_, run_, mem_*, job_id (here * is just the usual wildcard).

Then whenever the status of a JA becomes 5.0 meaning DONE the client creates an SGAS style JobUsageRecord, where the corresponding values are filled in from the last values of the parameters recorded from the JA. These values are cumulative. It is more appropriate to monitor JAs instead of individual jobs, so that we can also account for the resources used by a given agent while waiting for jobs, and not only for the resource usage of individual jobs. 

A sample JobUsageRecord is given below:

<JobUsageRecord>
<RecordIdentity recordId="NSC_1196328334976"/>
<JobIdentity>
<GlobalJobId>NSC_1196328334976</GlobalJobId>
<LocalJobId>NSC_n59.bluesmoke.nsc.liu.se:8085</LocalJobId>
 </JobIdentity>
 <Charge>16304.568000000001</Charge>
<Status>DONE</Status>
 <UserIdentity>
<GlobalUserId>aliprod</GlobalUserId>
 <LocalUserId>aliprod</LocalUserId>
</UserIdentity>
<WallDuration>18056.0</WallDuration>
<NodeCount>1</NodeCount>
<StartTime>2007-11-29T10:25:34.976Z</StartTime><SGASStartTime>1196328334976</SGASStartTime><ProjectName>alice</ProjectName>
<SGASProjectName>alice</SGASProjectName><SubmitHost/><Host>NSC_n59.bluesmoke.nsc.liu.se:8085</Host><Queue/><SGASLog>cpu_usage=0.4, Id=NSC_n59.bluesmoke.nsc.liu.se:8085, ja_status=5.0, Site=NSC, run_ksi2k=16304.568000000001, Status=DONE, JobIds=[8073916.0], run_time=18056.0, cpu_ksi2k=15962.331, mem_usage=3.9000000000000004, cpu_time=17677.0</SGASLog>
</JobUsageRecord>

In order to publish the records to LUTS, the script in ./bin/publish.sh needs to be run in regular intervals (daily is fine), when running the script cd to the bin directory first.