Execution of Analysis jobs using TAG selection on ANALY_MWT2

Introduction

Jobs have been submitted using Pathena (from uct3-edge5). Jobs ran from 1/12 to 1/13 2009 File results have been checked using DQ2 enduser clients.

Job description

The Job executed is an example available in DPD production but use TAGs for event selection:
  • Each Pathena job has 1 build job.
  • Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
  • It copies with dccp one input TAG to the run directory (in /scratch)
  • The AOD file is available (in PoolFileCatalog.xml) and accessed from the SE using dcap: dcache:/pnfs/uchicago.edu/...
  • There is one AOD file for each TAG file
  • Runs AODtoDPDvDecmm2.py using ATLAS rel 14.4.0 on the events selected from the TAG file
  • the algorithm processes the events and does also further filtering before writing the output ("ttbarFilter" to accept events)
  • At the end copies the file back to the USERDISK (using lcg-cp)

Each Pathena job has 1 build job. Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)

Some changes were necessary in Schedconfig to make possible the execution of Pathena jobs using the LFC and indirect access to files (references from TAGs or backnavigation):
proxy='donothide' 
copysetup=srm://uct2-dc1.uchicago.edu(:[0-9]+)*(/srm/managerv2?SFN=)*/pnfs/^dcache:/pnfs/^False^False

Input Dataset

For each of the 100 jobs

fdr08_run2.0052283.physics_Jet.merge.TAG.o3_f47_m26
  • total files: 21
  • total size: 44.6 MB (44654379)
  • avg 2.1 MB
But indirectly the real input dataset is fdr08_run2.0052283.physics_Jet.merge.AOD.o3_f47_m26:
  • 21 files
  • total size: 37.5 GB (37530871801)
  • avg 1.8 GB
  • 270615 events
  • it has been read by each job (that passed the build phase)

Results

Queries to the MySQL DB at BNL had to be changed (different server and structure, this may be due to the migration to the CERN DB).

From Panda Monitor I get:
  • Panda jobs 2361
    • finished 2299
    • failed 62

Build jobs completed in about 5min Run jobs took from 11min to about 2hours.

CPU Use (kSI2kseconds)
Job AVG min Max Total
finished RUN Jobs 288.1 98 683 596114
failed RUN Jobs 80.6 17 347 2498

CPU types are: Quad-Core AMD Opteron(tm) Processor 2350 512 KB and Dual Core AMD Opteron(tm) Processor 285 1024 KB

Wall Time Use (seconds)
Job AVG min Max Total
finished RUN Jobs 6694.7 812 29302 13851408
failed RUN Jobs 2622.3 306 18390 81292

Failed jobs:

Of the failled jobs:

The Athena crashes are all on uct2-c185.mwt2.org and are due to the failure to access the AOD file using dcap (dcache:/pnfs/...). The dcap library was missing on the node, the problem has been fixed (by Charles on 1/19).

Other errors are from different nodes and seem to be transient errors due to load: the request for the same file succeeded in other jobs, writing the output in the same directory succeeded, other jobs succeeded from those nodes.

Output files

Each Pathena job completing successfully reads one dataset with 2 files and access another file from the AOD dataset, produces 3 datasets, one output directory, 21 root files and 21 log files:
  • One dataset is used for the input files (DSname_shadow) and has no replicas at the end of the job
  • The other 2 datasets contain the same 42 files (21 root files and 21 log files): DSname and DSname_subXXX
  • most of the root files are around 4MB (except the last one of the job)
  • log files size varies (and are generally smaller)
  • below are statistics for the whole sample
  • Estimated total events written: 270K (26M read events, ttbar filter, excluding failures)
  • Input DS it has been read by each job (that passed the build phase)
  • File sizes are always measured in MB (10^6 bytes) unless otherwise specified

File type AVG min Max Total
Root files 4.5 2.5 13.0 9409.0

The output dataset like user09.MarcoMambelli.test.090112.tag.bm1f.XX have:
  • 42 files (21 DPD and 21 log files)
  • total size: 99.7 MB (99670041 = 95380659 DPD + 4289382 log)
  • avg: DPD 4.5 MB, log: .2 MB
  • ~ 2700 events
  • avg event size 34.5KB

Some queries and commands

mysql> select destinationDBlock, count(*)  from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by destinationDBlock;

mysql> select distinct transformation from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%';

mysql> select jobStatus, count(*), avg(cpuConsumptionTime), min(cpuConsumptionTime), max(cpuConsumptionTime), sum(cpuConsumptionTime) from jobsArchived_Dec2008  where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by jobStatus;

mysql> select jobStatus, count(*), avg(endTime-startTime), min(endTime-startTime), max(endTime-startTime), sum(endTime-startTime)  from jobsArchived_Dec2008  where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%'  and endTime!=0 and startTime!=0  group by jobStatus;


(reverse-i-search)`ptp': for i in tarball_PandaJob_231034*; do grep "done processing"  $i/athena_stdout.txt| tail -n 1; done > ptp
awk '{print $10}' ptp
awk '{total+=$10}END{print total}' ptp

Jobs ran from 1/12 to 1/13 2009

-- MarcoMambelli - 29 Jan 2009
Topic revision: r2 - 30 Jan 2009, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback