Execution of Analysis jobs using TAG selection on ANALY_MWT2
Introduction
Jobs have been submitted using Pathena (from uct3-edge5).
Jobs ran from 1/12 to 1/13 2009
File results have been checked using DQ2 enduser clients.
Job description
The Job executed is an example available in DPD production but use TAGs for event selection:
- Each Pathena job has 1 build job.
- Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
- It copies with dccp one input TAG to the run directory (in /scratch)
- The AOD file is available (in PoolFileCatalog.xml) and accessed from the SE using dcap: dcache:/pnfs/uchicago.edu/...
- There is one AOD file for each TAG file
- Runs AODtoDPDvDecmm2.py using ATLAS rel 14.4.0 on the events selected from the TAG file
- the algorithm processes the events and does also further filtering before writing the output ("ttbarFilter" to accept events)
- At the end copies the file back to the USERDISK (using lcg-cp)
Each Pathena job has 1 build job.
Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
Some changes were necessary in Schedconfig to make possible the execution of Pathena jobs using the LFC and indirect access to files (references from TAGs or backnavigation):
proxy='donothide'
copysetup=srm://uct2-dc1.uchicago.edu(:[0-9]+)*(/srm/managerv2?SFN=)*/pnfs/^dcache:/pnfs/^False^False
For each of the 100 jobs
fdr08_run2.0052283.physics_Jet.merge.TAG.o3_f47_m26
- total files: 21
- total size: 44.6 MB (44654379)
- avg 2.1 MB
But indirectly the real input dataset is fdr08_run2.0052283.physics_Jet.merge.AOD.o3_f47_m26:
- 21 files
- total size: 37.5 GB (37530871801)
- avg 1.8 GB
- 270615 events
- it has been read by each job (that passed the build phase)
Results
Queries to the
MySQL DB at BNL had to be changed (different server and structure, this may be due to the migration to the CERN DB).
From Panda Monitor I get:
Build jobs completed in about 5min
Run jobs took from 11min to about 2hours.
CPU Use (kSI2kseconds)
Job |
AVG |
min |
Max |
Total |
finished RUN Jobs |
288.1 |
98 |
683 |
596114 |
failed RUN Jobs |
80.6 |
17 |
347 |
2498 |
CPU types are:
Quad-Core AMD Opteron(tm) Processor 2350 512 KB and
Dual Core AMD Opteron(tm) Processor 285 1024 KB
Wall Time Use (seconds)
Job |
AVG |
min |
Max |
Total |
finished RUN Jobs |
6694.7 |
812 |
29302 |
13851408 |
failed RUN Jobs |
2622.3 |
306 |
18390 |
81292 |
Failed jobs:
Of the failled jobs:
- 26 failed because of a dccp timeout in getting the input file:
- Other get
- 4 (1+3) Log put errors
- pilot: Log put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 byte ddm: Adder._updateOutputs() XML is inconsistent with filesTable exe: Put error: Error in copying the file from job workdir to localSE
- http://panda.cern.ch:25880/server/pandamon/query?job=23057658
- (23098503) pilot: Log put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 byte exe: Put error: Error in copying the file from job workdir to localSE
- 32 Athena crashes
The Athena crashes are all on uct2-c185.mwt2.org and are due to the failure to access the AOD file using dcap (dcache:/pnfs/...). The dcap library was missing on the node, the problem has been fixed (by Charles on 1/19).
Other errors are from different nodes and seem to be transient errors due to load: the request for the same file succeeded in other jobs, writing the output in the same directory succeeded, other jobs succeeded from those nodes.
Output files
Each Pathena job completing successfully reads one dataset with 2 files and access another file from the AOD dataset, produces 3 datasets, one output directory, 21 root files and 21 log files:
- One dataset is used for the input files (DSname_shadow) and has no replicas at the end of the job
- The other 2 datasets contain the same 42 files (21 root files and 21 log files): DSname and DSname_subXXX
- most of the root files are around 4MB (except the last one of the job)
- log files size varies (and are generally smaller)
- below are statistics for the whole sample
- Estimated total events written: 270K (26M read events, ttbar filter, excluding failures)
- Input DS it has been read by each job (that passed the build phase)
- File sizes are always measured in MB (10^6 bytes) unless otherwise specified
The output dataset like user09.MarcoMambelli.test.090112.tag.bm1f.XX have:
- 42 files (21 DPD and 21 log files)
- total size: 99.7 MB (99670041 = 95380659 DPD + 4289382 log)
- avg: DPD 4.5 MB, log: .2 MB
- ~ 2700 events
- avg event size 34.5KB
Some queries and commands
mysql> select destinationDBlock, count(*) from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by destinationDBlock;
mysql> select distinct transformation from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%';
mysql> select jobStatus, count(*), avg(cpuConsumptionTime), min(cpuConsumptionTime), max(cpuConsumptionTime), sum(cpuConsumptionTime) from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by jobStatus;
mysql> select jobStatus, count(*), avg(endTime-startTime), min(endTime-startTime), max(endTime-startTime), sum(endTime-startTime) from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' and endTime!=0 and startTime!=0 group by jobStatus;
(reverse-i-search)`ptp': for i in tarball_PandaJob_231034*; do grep "done processing" $i/athena_stdout.txt| tail -n 1; done > ptp
awk '{print $10}' ptp
awk '{total+=$10}END{print total}' ptp
Jobs ran from 1/12 to 1/13 2009
--
MarcoMambelli - 29 Jan 2009