Testing OSG Storage Elements

Introduction

The same (a similar one) client used for the Grid Client test described in TestDQ2Client080725 is used here to test the different Gris sites in US-ATLAS. I'm trying to emulate the condition of a Grid user sitting far from everything, submitting jobs with Pathena and trying to retrieve the output files of her/his jobs (and maybe some other files).

First, I'm giving a personal evaluation of the configuration described in TiersOfATLASCache as of today (and some suggestion for future changes). Then I'm trying to submit a simple Athena job (hoping that runs fine everywhere) and I'm trying to retrieve the output files back to my client machine.

Checking TiersOfATLASCache

This file describes the configuration of all ATLAS SEs(Storage Elements). The URL is: http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py

In the file there is the URL used as base entry for the external access to all files in the catalog of a SE. Some statistics about these URLs:
  • there are 370 SRM entries
  • 253 use the extended URL for SRM v2 (/srm/managerv2?SFN), none uses it for SRM v1 (/srm/managerv1?SFN)
  • 18 use the Bestman version of the extended URL for SRM v2 (/srm/v2/server?SFN), none uses it for SRM v1 (/srm/v1/server?SFN)
  • 142 specify explicitly port 8443
  • 127 specify explicitly port 8446
  • probably 100 URLs do not specify a port
  • 264 use a token. Some token names are ATLASDATADISK, ATLASDATA, DATADISK, ATLASMCDISK, MCDISK, ATLASUSER

Some recommendation:
  • add the port number to all URLs without it. Even if SRM supports a default port. Some clients do not. It is likely to find problems when trying to retrieve the files registered in the SEs that have no default port number in ToA
  • if a SE supports only SRMv1 or only SRMv2 it is recommended to use the extended URL with the correct manager

Test1: file retrieval

Some of the most common files are the FDR and the DBReleases files

In this test I used dq2-get to retrieve a FDR file:
dq2-get -L UCT3 -s %(site_from)s %(prot_par)s -f fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10._0020.1 fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10 >& dq2gettest.log
and collected timing information both using the time command and measuring the time for the execution to return. The command were obtained substituting the parameters in the string above: %(site_from)s is the site from which the dataset is fetched, %(prot_par)s is empty or specifies the optional protocol suite option for dq2-get (-d lcg or -p srm in the test, see the summary spreadsheet). Each command was executed in a subprocess, one at the time and killed if not completed after 10min. Other test info:
  • host tier2-06.uchicago.edu. In UofC campus but separate from the Tier2 cluster
  • clients wlcg-client 0.13rc and dq2-clients 0.1.17
  • SE tested (see spreadsheet) are the ones in USA form the list returned by dq2-ls for the dataset: SWT2_CPB_DATADISK, SLACXRD_DATADISK, WISC, AGLT2_DATADISK, MWT2_DATADISK, NET2_DATADISK, BNL-OSG2_DATADISK, BNLXRDHDD1

Results are summarized in the attached spreadsheet (sitetest080806.csv, sitetest080806.ods):
  • only one test timed out (probably SE problems, I was able to retrieve the file in 1m10s on 8/7/08)
  • some transfers failed reporting [Errno 2] No such file or directory: 'fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10/fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10._0020.1' in the log file (but dq2-get exit code was 0)
  • lcg-cp works only with complete URLs (no BDII involved) and is generally faster (3-4 M/s)
  • srmcp is generally slower (2 M/s) but works also with short URLs
  • srmcp from UofC resulted much faster (11 MB/s) but the path is all within the campus

Test2: ATLAS job and output retrieval

ATLAS jobs have been submitted to all available analysis sites in Panda (name starting with ANALY_).
  • CE used: ANALY_SLAC, ANALY_SWT2_CPB, ANALY_NET2, ANALY_OU_OCHEP_SWT2, ANALY_AGLT2, ANALY_MWT2_SHORT, ANALY_MWT2, ANALY_BNL_ATLAS_1, default
  • The job submitted was a simple Pathena evgen job (from the Pathena twiki): no input files, ATLAS rel 14.1.0, short job, 1 output file and 1 log file
  • Each job produced 3 datasets:
    • datasetname_sub02196186 and datasetname_shadow, two or more temporary datasets that are used for the job and and up being empty at the end
    • datasetname, the only that counts for the output, the name is the one specified in the pathena command line. This dataset is registered as incomplete in the CE where the job ran
  • Output dataset names are user08.MarcoMambelli.test.evgen.080813.xfer._CEname_

Results:
CE name SE name # files dq2-get -p lcg -p srm
ANALY_SLAC SLACXRD 2 2 2 2
ANALY_SWT2_CPB SWT2_CPB 2 2 2 2
ANALY_NET2 BU 2 2 2 0
ANALY_OU_OCHEP_SWT2 OU 2 0 0 0
ANALY_AGLT2 AGLT2_PRODDISK        
ANALY_MWT2_SHORT MWT2_UC 2 2 2 2
ANALY_MWT2 MWT2_UC 2 2 2 2
ANALY_BNL_ATLAS_1 BNLPANDA 2 0 0 2
default SWT2_CPB 2 2 2 2

  • # files as from: dq2-ls -L UCT3 -f dsname
  • Job to ANALY_AGLT2 failed with error: Put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec instglobus_ftp_client_state.c:globus_i_ftp_client_response_callback:3616: the server respo
  • srmcp from BU has been hanging for a while (tens of minutes)

Summary

To summarize the results of the test I prepared a Powerpoint presentation for the facilities meeting of 08/12: USATLAS-SEreport.pdf

-- MarcoMambelli - 28 Jul 2008

Topic revision: r10 - 13 Aug 2008, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback