Testing OSG Storage Elements
Introduction
The same (a similar one) client used for the Grid Client test described in
TestDQ2Client080725 is used here to test the different Gris sites in US-ATLAS.
I'm trying to emulate the condition of a Grid user sitting far from everything, submitting jobs with Pathena and trying to retrieve the output files of her/his jobs (and maybe some other files).
First, I'm giving a personal evaluation of the configuration described in TiersOfATLASCache as of today (and some suggestion for future changes).
Then I'm trying to submit a simple Athena job (hoping that runs fine everywhere) and
I'm trying to retrieve the output files back to my client machine.
Checking TiersOfATLASCache
This file describes the configuration of all ATLAS SEs(Storage Elements).
The URL is:
http://atlas.web.cern.ch/Atlas/GROUPS/DATABASE/project/ddm/releases/TiersOfATLASCache.py
In the file there is the URL used as base entry for the external access to all files in the catalog of a SE.
Some statistics about these URLs:
- there are 370 SRM entries
- 253 use the extended URL for SRM v2 (
/srm/managerv2?SFN
), none uses it for SRM v1 (/srm/managerv1?SFN
)
- 18 use the Bestman version of the extended URL for SRM v2 (
/srm/v2/server?SFN
), none uses it for SRM v1 (/srm/v1/server?SFN
)
- 142 specify explicitly port 8443
- 127 specify explicitly port 8446
- probably 100 URLs do not specify a port
- 264 use a token. Some token names are ATLASDATADISK, ATLASDATA, DATADISK, ATLASMCDISK, MCDISK, ATLASUSER
Some recommendation:
- add the port number to all URLs without it. Even if SRM supports a default port. Some clients do not. It is likely to find problems when trying to retrieve the files registered in the SEs that have no default port number in ToA
- if a SE supports only SRMv1 or only SRMv2 it is recommended to use the extended URL with the correct manager
Test1: file retrieval
Some of the most common files are the FDR and the DBReleases files
In this test I used
dq2-get
to retrieve a FDR file:
dq2-get -L UCT3 -s %(site_from)s %(prot_par)s -f fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10._0020.1 fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10 >& dq2gettest.log
and collected timing information both using the time command and measuring the time for the execution to return.
The command were obtained substituting the parameters in the string above:
%(site_from)s
is the site from which the dataset is fetched,
%(prot_par)s
is empty or specifies the optional
protocol suite option for
dq2-get
(
-d lcg
or
-p srm
in the test, see the summary spreadsheet).
Each command was executed in a subprocess, one at the time and killed if not completed after 10min.
Other test info:
- host
tier2-06.uchicago.edu
. In UofC campus but separate from the Tier2 cluster
- clients wlcg-client 0.13rc and dq2-clients 0.1.17
- SE tested (see spreadsheet) are the ones in USA form the list returned by
dq2-ls
for the dataset: SWT2_CPB_DATADISK, SLACXRD_DATADISK, WISC, AGLT2_DATADISK, MWT2_DATADISK, NET2_DATADISK, BNL-OSG2_DATADISK, BNLXRDHDD1
Results are summarized in the attached spreadsheet (
sitetest080806.csv,
sitetest080806.ods):
- only one test timed out (probably SE problems, I was able to retrieve the file in 1m10s on 8/7/08)
- some transfers failed reporting
[Errno 2] No such file or directory: 'fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10/fdr08_run2.0052304.physics_Jet.merge.AOD.o3_f8_m10._0020.1'
in the log file (but dq2-get exit code was 0)
- lcg-cp works only with complete URLs (no BDII involved) and is generally faster (3-4 M/s)
- srmcp is generally slower (2 M/s) but works also with short URLs
- srmcp from UofC resulted much faster (11 MB/s) but the path is all within the campus
Test2: ATLAS job and output retrieval
ATLAS jobs have been submitted to all available analysis sites in Panda (name starting with ANALY_).
- CE used: ANALY_SLAC, ANALY_SWT2_CPB, ANALY_NET2, ANALY_OU_OCHEP_SWT2, ANALY_AGLT2, ANALY_MWT2_SHORT, ANALY_MWT2, ANALY_BNL_ATLAS_1, default
- The job submitted was a simple Pathena evgen job (from the Pathena twiki): no input files, ATLAS rel 14.1.0, short job, 1 output file and 1 log file
- Each job produced 3 datasets:
-
datasetname_sub02196186
and datasetname_shadow
, two or more temporary datasets that are used for the job and and up being empty at the end
-
datasetname
, the only that counts for the output, the name is the one specified in the pathena command line. This dataset is registered as incomplete in the CE where the job ran
- Output dataset names are
user08.MarcoMambelli.test.evgen.080813.xfer._CEname_
Results:
CE name |
SE name |
# files |
dq2-get |
-p lcg |
-p srm |
ANALY_SLAC |
SLACXRD |
2 |
2 |
2 |
2 |
ANALY_SWT2_CPB |
SWT2_CPB |
2 |
2 |
2 |
2 |
ANALY_NET2 |
BU |
2 |
2 |
2 |
0 |
ANALY_OU_OCHEP_SWT2 |
OU |
2 |
0 |
0 |
0 |
ANALY_AGLT2 |
AGLT2_PRODDISK |
|
|
|
|
ANALY_MWT2_SHORT |
MWT2_UC |
2 |
2 |
2 |
2 |
ANALY_MWT2 |
MWT2_UC |
2 |
2 |
2 |
2 |
ANALY_BNL_ATLAS_1 |
BNLPANDA |
2 |
0 |
0 |
2 |
default |
SWT2_CPB |
2 |
2 |
2 |
2 |
- # files as from:
dq2-ls -L UCT3 -f dsname
- Job to ANALY_AGLT2 failed with error:
Put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec instglobus_ftp_client_state.c:globus_i_ftp_client_response_callback:3616: the server respo
-
srmcp
from BU has been hanging for a while (tens of minutes)
Summary
To summarize the results of the test I prepared a Powerpoint presentation for the facilities meeting of 08/12:
USATLAS-SEreport.pdf
--
MarcoMambelli - 28 Jul 2008