You are here: Foswiki>Main Web>WikiUsers>MarcoMambelli>MarcoWorkPages>TestBM090112 (30 Jan 2009, MarcoMambelli)Edit Attach

Execution of Analysis jobs using TAG selection on ANALY_MWT2

Introduction
Job description
Input Dataset
Results
Failed jobs:
Output files
Some queries and commands

Introduction

Jobs have been submitted using Pathena (from uct3-edge5). Jobs ran from 1/12 to 1/13 2009 File results have been checked using DQ2 enduser clients.

Job description

The Job executed is an example available in DPD production but use TAGs for event selection:

Each Pathena job has 1 build job.
Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
It copies with dccp one input TAG to the run directory (in /scratch)
The AOD file is available (in PoolFileCatalog.xml) and accessed from the SE using dcap: dcache:/pnfs/uchicago.edu/...
There is one AOD file for each TAG file
Runs AODtoDPDvDecmm2.py using ATLAS rel 14.4.0 on the events selected from the TAG file
the algorithm processes the events and does also further filtering before writing the output ("ttbarFilter" to accept events)
At the end copies the file back to the USERDISK (using lcg-cp)

Each Pathena job has 1 build job. Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)

Some changes were necessary in Schedconfig to make possible the execution of Pathena jobs using the LFC and indirect access to files (references from TAGs or backnavigation):

proxy='donothide' 
copysetup=srm://uct2-dc1.uchicago.edu(:[0-9]+)*(/srm/managerv2?SFN=)*/pnfs/^dcache:/pnfs/^False^False

Input Dataset

For each of the 100 jobs

fdr08_run2.0052283.physics_Jet.merge.TAG.o3_f47_m26

total files: 21
total size: 44.6 MB (44654379)
avg 2.1 MB

But indirectly the real input dataset is fdr08_run2.0052283.physics_Jet.merge.AOD.o3_f47_m26:

21 files
total size: 37.5 GB (37530871801)
avg 1.8 GB
270615 events
it has been read by each job (that passed the build phase)

Results

Queries to the MySQL DB at BNL had to be changed (different server and structure, this may be due to the migration to the CERN DB).

From Panda Monitor I get:

Panda jobs 2361
- finished 2299
- failed 62

Build jobs completed in about 5min Run jobs took from 11min to about 2hours.

CPU Use (kSI2kseconds)

Job	AVG	min	Max	Total
finished RUN Jobs	288.1	98	683	596114
failed RUN Jobs	80.6	17	347	2498

CPU types are: Quad-Core AMD Opteron(tm) Processor 2350 512 KB and Dual Core AMD Opteron(tm) Processor 285 1024 KB

Wall Time Use (seconds)

Job	AVG	min	Max	Total
finished RUN Jobs	6694.7	812	29302	13851408
failed RUN Jobs	2622.3	306	18390	81292

Failed jobs:

Of the failled jobs:

26 failed because of a dccp timeout in getting the input file:
- pilot: Get error: dccp get was timed out after 18000 seconds
- e.g. http://panda.cern.ch:25880/server/pandamon/query?job=23055633
Other get
- pilot: Get error: No such file or directory: /scratch/Panda_Pilot_20402_1231788010/PandaJob_23077891_1231788012/fdr08_run2.0052283.physics_Jet.merge.TAG.o3_f47_m26._0022.1
- http://panda.cern.ch:25880/server/pandamon/query?job=23077891
4 (1+3) Log put errors
- pilot: Log put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 byte ddm: Adder._updateOutputs() XML is inconsistent with filesTable exe: Put error: Error in copying the file from job workdir to localSE
- http://panda.cern.ch:25880/server/pandamon/query?job=23057658
- (23098503) pilot: Log put error: No such file or directory: 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 byte exe: Put error: Error in copying the file from job workdir to localSE
32 Athena crashes
- trans: Athena crash - consult log file
- http://panda.cern.ch:25880/server/pandamon/query?job=23101232
- http://panda.cern.ch:25880/server/pandamon/query?job=23098433

The Athena crashes are all on uct2-c185.mwt2.org and are due to the failure to access the AOD file using dcap (dcache:/pnfs/...). The dcap library was missing on the node, the problem has been fixed (by Charles on 1/19).

Other errors are from different nodes and seem to be transient errors due to load: the request for the same file succeeded in other jobs, writing the output in the same directory succeeded, other jobs succeeded from those nodes.

Output files

Each Pathena job completing successfully reads one dataset with 2 files and access another file from the AOD dataset, produces 3 datasets, one output directory, 21 root files and 21 log files:

One dataset is used for the input files (DSname_shadow) and has no replicas at the end of the job
The other 2 datasets contain the same 42 files (21 root files and 21 log files): DSname and DSname_subXXX
most of the root files are around 4MB (except the last one of the job)
log files size varies (and are generally smaller)
below are statistics for the whole sample
Estimated total events written: 270K (26M read events, ttbar filter, excluding failures)
Input DS it has been read by each job (that passed the build phase)
File sizes are always measured in MB (10^6 bytes) unless otherwise specified

File type	AVG	min	Max	Total
Root files	4.5	2.5	13.0	9409.0

The output dataset like user09.MarcoMambelli.test.090112.tag.bm1f.XX have:

42 files (21 DPD and 21 log files)
total size: 99.7 MB (99670041 = 95380659 DPD + 4289382 log)
avg: DPD 4.5 MB, log: .2 MB
~ 2700 events
avg event size 34.5KB

Some queries and commands

mysql> select destinationDBlock, count(*)  from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by destinationDBlock;

mysql> select distinct transformation from jobsArchived_Dec2008 where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%';

mysql> select jobStatus, count(*), avg(cpuConsumptionTime), min(cpuConsumptionTime), max(cpuConsumptionTime), sum(cpuConsumptionTime) from jobsArchived_Dec2008  where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%' group by jobStatus;

mysql> select jobStatus, count(*), avg(endTime-startTime), min(endTime-startTime), max(endTime-startTime), sum(endTime-startTime)  from jobsArchived_Dec2008  where prodUserID='/DC=org/DC=doegrids/OU=People/CN=Marco Mambelli 325802/CN=proxy' and computingSite='ANALY_MWT2' and destinationDBlock like 'user09.MarcoMambelli.test.090112.tag.bm1f%'  and endTime!=0 and startTime!=0  group by jobStatus;


(reverse-i-search)`ptp': for i in tarball_PandaJob_231034*; do grep "done processing"  $i/athena_stdout.txt| tail -n 1; done > ptp
awk '{print $10}' ptp
awk '{total+=$10}END{print total}' ptp

Jobs ran from 1/12 to 1/13 2009

-- MarcoMambelli - 29 Jan 2009

Topic revision: r2 - 30 Jan 2009, MarcoMambelli

Main

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback