Evaluation of Direct Access

Introdiction

Test conducted from uct3-edge5, dir /ecache/marco/wdir_pathena/ Using ATLAS release Pathena setup modified to run Skim jobs, according to

Setup

Pathena is already installed and patched. This i just to set the environment to run
. /osg/app/atlas_app/atlas_rel/13.0.40/cmtsite/setup.sh -tag=13.0.40.3,AtlasProduction
. /osg/app/atlas_app/atlas_rel/13.0.40/AtlasProduction/13.0.40.3/AtlasProductionRunTime/cmt/setup.sh
. /share/osg-client/setup.sh 
cd /ecache/marco/wdir_pathena/
export CMTPATH=`pwd`:${CMTPATH}
export CVSROOT=:ext:mambelli@atlas-sw.cern.ch:/atlascvs;export CVS_RSH=ssh;export PATHENA_GRID_SETUP_SH=/share/osg-client/setup.sh 
pushd PhysicsAnalysis/DistributedAnalysis/PandaTools/cmt
. ./setup.sh 
make
...
popd

ROOT 5.18 is available in ~antonk/usr/root

Pathena invocations

Here some example of Pathena invocations. A complete list with all the job IDs is attached here.
> pathena -v --inputFileList=filelist.txt --shipInput --outDS=user.MarcoMambelli.test.tag.080324._01 AnalysisSkeleton_topOptions.py

===================
 JobID  : 32
 Status : 0
  > build
    PandaID=8822419
  > run
    PandaID=8822420

> pathena -v --inputFileList=filelist.txt --shipInput --outDS=user.MarcoMambelli.test.tag.080324._01 --site=ANALY_AGLT2 AnalysisSkeleton_topOptions.py
> pathena -v --inputFileList=filelist.txt --shipInput --outDS=user.MarcoMambelli.test.tag.080324._02 --site=ANALY_AGLT2 AnalysisSkeleton_topOptions.py

Checking job status

Panda monitoring (Marco Mambelli's analysis jobs): http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?ui=user&name=Marco%20Mambelli

Retrieving the output

Using dq2_get

After setting up DQ2 I noticed that no grid command was anymore in the environment. Something affected the environment. Could be a problem to investigate.

Opened a new terminal
. /share/osg-client/setup.sh 
. /share/dq2/dq2.sh 
grid-proxy-init 
> cd /ecache/marco/wdir_pathena/tag_pathena/output/
> cd 080324/

Transfer example
[uct3-edge5] /ecache/marco/wdir_pathena/tag_pathena/output/080324 >  dq2_get -rv user.MarcoMambelli.test.tag.080324._07 
2 files are missing in the local SE
http://gk02.atlas-swt2.org:8000/dq2/lrc/PoolFileCatalog
lfns=user.MarcoMambelli.test.tag.080324._07.AANT._00001.root+user.MarcoMambelli.test.tag.080324._07._8823055.log.tgz
(0, '\n\n\n\n\n  \n\n  \n\n  \n\n  \n\n  \n\n  \n    \n      \n    \n    \n      \n    \n    \n    \n    \n    \n    \n  \n\n  \n    \n      \n    \n    \n      \n    \n    \n    \n    \n    \n    \n  \n\n\n')
globus-url-copy gsiftp://gk01.atlas-swt2.org/xrd/dq2/user.MarcoMambelli.test.tag.080324._07_sub01430593/user.MarcoMambelli.test.tag.080324._07.AANT._00001.root file:////ecache/marco/wdir_pathena/tag_pathena/output/080324/./user.MarcoMambelli.test.tag.080324._07.AANT._00001.root
globus-url-copy gsiftp://gk01.atlas-swt2.org/xrd/dq2/user.MarcoMambelli.test.tag.080324._07_sub01430593/user.MarcoMambelli.test.tag.080324._07._8823055.log.tgz file:////ecache/marco/wdir_pathena/tag_pathena/output/080324/./user.MarcoMambelli.test.tag.080324._07._8823055.log.tgz
Done
Total:2 - Failed:0

Results

Some job like 8823055 at SWT2_CPB (same as the job executed at MWT2) produce a small output file and in the AANT there is a CollectionTree with few elements and no possibility to do any graph. These jobs have a log file like this with a lot of errors like
Error in : file /xrd/dq2/fdr08_run1/AOD/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1._0001.2 does not exist
Error in : file /xrd/dq2/fdr08_run1/AOD/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1._0001.2 does not exist
Error in : file /xrd/dq2/fdr08_run1/AOD/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1._0001.2 does not exist
...
Error in : file /xrd/dq2/fdr08_run1/AOD/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1._0002.1 does not exist
Error in : file /xrd/dq2/fdr08_run1/AOD/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1/fdr08_run1.0003070.StreamEgamma.merge.AOD.o1_r12_t1._0002.1 does not exist
There are plenty of lines, but all seem complain about 2 missing files. In fact the PoolFileCatalog.xml is including only 2 files.

Some other like 8823051 at NET2 produce a bigger output that in root seems to be complete and produces some plots. These jobs have log files like this that do not include the errors above.

I think the jobs with the small output file are unable to find the AOD files used as input because direct input access is not working on those sites (or the references used by Pathena are wrong), but, instead of terminating with an error, they end with an empty output file and exit code 0.

Pathena pretends that all files accessed using back-navigation (and the AOD for the skim are considered backnavigation from the TAG file) must be directly accessible (in a shared file system). They will not be prestaged in the run directory like it is done for the normal input files.

Summary

Summary of job completion:
  • ANALY_MWT2: 40, succ (user.MarcoMambelli.test.tag.080324._08), copied, small file
  • ANALY_SWT2_CPB: 39, succ (user.MarcoMambelli.test.tag.080324._07), copied, small file
  • ANALY_NET2: 38, succ (user.MarcoMambelli.test.tag.080324._06), copied, OK
  • ANALY_LONG_BNL_ATLAS: 37, succ (user.MarcoMambelli.test.tag.080324._05), 0 length
  • ANALY_UTA: 36, ? (blank) (user.MarcoMambelli.test.tag.080324._04), ERROR : no constituent files
  • ANALY_SLAC: 35, succ (user.MarcoMambelli.test.tag.080324._03), ERROR : SLACXRD LRC returned invalid response
(knows the names but they are not found)
  • ANALY_AGLT2 (2 jobs): 33, 34, succ (both) (user.MarcoMambelli.test.tag.080324._02/01) dq2_get hanging? OK
  • default (ANALY_BNL_ATLAS_1): 32, succ (user.MarcoMambelli.test.tag.080324._01) dq2_get hanging

I'm surprised that job 32 and 33 both completed successfully: the output dataset has the same name. Anyway I was not able to retrieve the output.

All jobs (except one) seem to complete successfully (exit code 0) but several jobs failed because they were not finding some input files (merge AOD, referred as direct input by Pathena using back navigation). The ntuple produced seems empty.

Only 2 CE produced what seems valid data: ANALY_NET2 and ANALY_AGLT2 (in the run where I was able to retrieve the )

The successful completion of the Panda job is not a significant indication that the real job completed successfully (all jobs were successful according to Panda).

-- MarcoMambelli - 24 Mar 2008
Topic revision: r3 - 25 Mar 2008, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback