Summary of the 2004 status of GRIDPP funded contributions to CMS

Introduction

The GRIDPP funded UK based activities contributing to the CMS computing model are in three areas. Workload management, data replication and application monitoring. This document summarises the status of each of these areas following a major data challenge undertaken by CMS during 2004. More details on the status and future prospects can be found in a number of recent publications which are linked from these pages.

Data Challenge 2004 (DC04)

In 2004 CMS undertook a large-scale challenge aining to simulate production and distribution of data at 25% of the full start-up rate of LHC. A second aspect of the challenge was to analyse the Monte Carlo data produced as a result of the data challenge. A great deal of valuable experience was gained from the use of prototype tools, the UK contribution made to the data replication and analysis is summarised below. The LCG grid was used as the testbed for the final versions of the tools to which the UK contributed.

Data replication from T0 to T1 to T2 regional centres

Data transfer from the T0 centre at CERN used three distinct strategies: LCG-2 replica manager, native Storage Resource Broker (SRM) and the Storage Resource Broker (SRB) system. During the DC04 challenge the LCG-2 distribution chain showed reasonable overall performance. The performance of the SRB chain did not reach production quality requirements for a number of technical reasons connected with the unavailability (for some periods) of the metadata catalogue, bugs in the code and problems with dealing with large numbers of small files. More positively it was demonstrated that sustained T0 to T1 data transfer rates of 80 Mb per second could be achieved for large files. The experience of the SRM chain, which provided adequate access to data based in the USA, was encouraging and provided a useful model for future developments in this area. A new CMS data replication project building on the DC04 experience, in which the UK plays a leading role, is PhEDEx. PhEDEx is currently feeding back into the EGEE project and influencing their development of catalogues and the further development of SRM.

Batch analysis of CMS data on LCG2

A software package, known as GROSS, was produced as part of the CMS prototyping effort to deploy physics analysis tools across a computing Grid. The aim was to provide end users with a single interface anabling them to run analysis tasks on the Grid. GROSS provides for multi-user analysis, and stores persistent metadata associated with these tasks in a database. Typically a user wishes to analyse a data set (for example W+jet events) which will in practice be split up across hundreds of individual experimental runs. GROSS handles the splitting of the single user task into all of the individual jobs and archives the sub-jobs for later submission to LCG (or indeed a local batch system). Following the production of data sets in DC04, GROSS was successfully tested on data stored in the UK, Italy and Taiwan. The GROSS code and documentation have recently been made available, and it informs the current development of the final CMS production job run-time environment.

Monitoring the Grid Jobs on LCG2

R-GMA is a monitoring and information management system for distributed resources. We have been evaluating it as a possible mechanism to allow users to monitor the progress of their jobs submitted to Grid resources. In particular we have been working closely with the developers of R-GMA in the EDG and EGEE projects to provide the performance needed for full production during LHC running. Recently R-GMA has become part of the standard LCG release and we have been measuring its performance and assiting with testing its deployment on the LCG2 Grid.

Our conclusions (testing is still in progress) are that the version released on the LCG 2.2.0 test-bed is likely to be suitable as a means of sending back the messages that are produced when standard CMS jobs, submitted using BOSS, are run on remote resources. Currently fully operational implemetations of R-GMA are being rolled out onto LCG and we are optimistic that it will achieve adequate performance under the full message rate expected (3000 simultaneous jobs, each with a duration of about 10 hours, each producing 74 messages per job).

Publications