Execution of Analysis jobs on ANALY_MWT2
Introduction
Jobs have been submitted using Pathena (from uct3-edge5).
Results have been checked in the Panda DB with mysql client typing the queries.
jobsArchived4 is the table containing my analysis jobs.
File results have been checked using DQ2 enduser clients.
Job description
The Job executed is an example available in DPD production:
- Each Pathena job has 1 build job.
- Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
- It copies with dccp on input AOD to the run directory (in /scratch)
- Runs AnalyzeJpsiphi.py
- At the end copies the file back to the USERDISK (using lcg-cp)
Each Pathena job has 1 build job.
Each job is split in 1 job per input file, this makes 21 jobs (input dataset has 21 files)
Results
Generic statistics:
- number of submission: 250 Pathena jobs (repetitions of the same job)
- resulting Panda jobs: 5250 Panda jobs
- 250 build jobs
- 5000 run jobs
- Build jobs
- Run jobs
- finished 5138
- failed 112 (63 of which never started due to build job failure)
CPU Use (kSI2kseconds)
Job |
AVG |
min |
Max |
Total |
finished BUILD Jobs |
453.34 |
388 |
933 |
111974 |
failed BUILD Jobs |
271.00 |
0 |
416 |
816 |
finished RUN Jobs |
1299.36 |
481 |
3803 |
6676092 |
failed RUN Jobs |
322.15 |
0 |
1989 |
36081 |
Total |
1240.90 |
0 |
3803 |
6824960 |
CPU types are:
Quad-Core AMD Opteron(tm) Processor 2350 512 KB and
Dual Core AMD Opteron(tm) Processor 285 1024 KB
Wall Time Use (seconds)
Job |
AVG |
min |
Max |
Total |
finished BUILD Jobs |
4045.72 |
2226 |
6673 |
999293 |
failed BUILD Jobs |
4289.33 |
1 |
6533 |
12868 |
finished RUN Jobs |
10818.82 |
873 |
767584 |
55576278 |
failed RUN Jobs |
16042.62 |
1 |
50218 |
770046 |
Total |
10867.19 |
1 |
767584 |
56346324 |
Not started (63 failed run jobs) jobs are excluded from the wall-time count due to wrong entries in the DB.
Each Pathena job completing successfully reads one dataset with 2 files, produces 3 datasets, one output directory, 21 root files and 21 log files:
- One dataset is used for the input files (DSname_shadow) and has no replicas at the end of the job
- The other 2 datasets contain the same 42 files (21 root files and 21 log files): DSname and DSname_subXXX
- most of the root files are around MB (except the last one of the job)
- log files size varies (and are generally smaller)
- below are statistics about both 1 successfully completed job (1J) and for the whole sample
- File sizes are always measured in MB (10^6 bytes) unless otherwise specified
- Estimated total events written: 66210470 (one per read events, excluding failures)
File type |
AVG |
min |
Max |
Total |
Root files 1J |
63.0 |
38.4 |
68.8 |
1323.1 |
LOG files 1J |
0.21 |
0.16 |
0.22 |
4.4 |
Total 1J |
31.6 |
0.16 |
68.8 |
1327.5 |
Root files |
63.0 |
0 |
68.9 |
322771.3 |
LOG files |
0.21 |
0 |
0.57 |
1080.5 |
Total |
30.7 |
0 |
68.9 |
326171.5 |
The input dataset is fdr08_run2.0052283.physics_Jet.merge.AOD.o3_f47_m26:
- 21 files
- 37.5 GB
- 270615 events
- it has been read by each job (that passed the build phase)
- total events read: 66841905
The job is not really a skim, the skim ratio is 100% (all events are written to the output)
Plot from Charles
Nice plot that shows 5000 jobs completing:
Conclusion
The jobs caused some trouble in the cluster, specially for the gatekeeper and the NFS server for the home directories.
Anyway it is not possible to check now whether pathena is abusing the gass cache, since there is no track of the data flow. That has to be done while the job is running.
These analysis jobs have nothing special, different from others:
- pathena is staging the pilot and its auxiliary files using Globus gass-cache
- the jobs use the movers to copy input and output files
--
MarcoMambelli - 26 Nov 2008