Performance evaluation of dCache using TAG Jobs

Introduction

TAG Jobs were observed to be very slow when running through Pathena in queues like ANALY_MWT2. To isolate the problem here I'm doing some local executions of the job, accessing the files locally and from dCache.

dCache is the primary suspect, specially since there is literature about the effect of dCache ReadAhead buffer and this buffer is not set at MWT2: http://t2.unl.edu/Members/brian/examining-dcache-read-ahead

The job

It is an analysis jobs with ~1600 events selected with ELSSI our of the Egamma stream of FDR2. Files involved are:
 1500007934  fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0001.1
 1467692918  fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0002.1
   966444180  fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0003.1
Located in dCache: /pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/ and locally: /scratch/marco/infiles

Attached is the ELSSI file and the jobOption file.

Jobs are run using an auxiliary script , run_job, attached to the page.

To find the input files referenced by the TAG file

I'm using a mix of Athena POOL utilities and sed/grep:
CollListToken -src fdr08_run2_physics_Egamma_o3_f8_m10_1220541569_user RootCollection > tokens.txt
grep "NAME: Token" tokens.txt | grep "VALUE:" | sed "s|^.*DB=||" | sed "s|\]\[CNT.*||" | sort | uniq -c 
grep "NAME: Token" tokens.txt | grep "VALUE:" | sed "s|^.*DB=||" | sed "s|\]\[CNT.*||" | sort -u 
There are ~1600 events all coming from e AOD files. To create the PoolFileCatalog.xml (the 3 guid are printed by the command above):
curl http://uct2-grid1.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/?guids=04E7EB10-C037-DD11-8EB3-0030489455E2+1054722A-C037-DD11-BA45-0030487CD938+D4802C11-C037-DD11-9315-0030487CD722 > PFC.xml
sed "s|srm://uct2-dc1.uchicago.edu/|dcache:/|" PFC.xml | sed "s|pfn filetype=\"\"|pfn filetype=\"ROOT_All\"|" > PoolFileCatalog.xml

Checking the output

I'm checking that the job completes successfully (exit code 0) and some of the output were also inspected.

Testbed

All tests are executed on uct2-c050 a dual core dual CPU (AMD) worker node part of the MWT2_UC cluster. The node has been removed from the batch queue.

Running the job all locally

This sets the baseline for the performance: if should be the best performance that we can expect to achieve. All tests have a running directory similar to:
$ ls -l test4P1
total 1596
-rw-r--r--  1 marco mwt2 405779 Sep 25 19:10 AnalysisSkeleton.aan.root
-rw-r--r--  1 marco mwt2   2168 Sep 25 19:08 PoolFileCatalog.xml
-rw-r--r--  1 marco mwt2   2168 Sep 25 19:10 PoolFileCatalog.xml.BAK
-rw-r--r--  1 marco mwt2 836535 Sep 25 19:10 athenaout.test4P1.txt
-rw-r--r--  1 marco mwt2    170 Sep 25 19:08 dcap_output.test4P1
-rw-r--r--  1 marco mwt2 348129 Sep 25 19:08 fdr08_run2_physics_Egamma_o3_f8_m10_1220541569_user.root
-rw-r--r--  1 marco mwt2   9082 Sep 25 19:08 test_tagOptions.py
And a script output like:
test4P1 no
test4P1, 0, 
cp: omitting directory `../srcdir/old'
Starting athena job (/scratch/marco/local-test080925/test4P1): Thu Sep 25 19:08:08 CDT 2008
No dCache RA Buffer

real    2m38.898s
user    2m30.044s
sys     0m8.598s
Job ended: Thu Sep 25 19:10:47 CDT 2008
There were 0 cache hits and 0 cache misses (miss rate: 100.00%)
There were 0 bytes read and 0 buffered bytes used; efficiency: 0.00%
The results are the same for the job that ran by itself and for the 4 jobs that ran in parallel (setting RE Buffer produces no changes). They complete in about 2'40" and it is almost all user time (>2'30"), no dead times waiting.

Tests form dCache

Files are accessed from dCache with URL in PoolFileCatalog that look like: dcache:/pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0002.1

As documented in http://t2.unl.edu/Members/brian/examining-dcache-read-ahead and http://www-dcache.desy.de/manuals/libdcap.html (or http://wiki.etp.physik.uni-muenchen.de/tiki-index.php?page=dCache#attachments) there are some variables checking if dCache is using the ReadAhead buffer and how big it is.
  • DCACHE_RAHEAD=1 1 or True activate the buffer
  • DCACHE_RA_BUFFER=32768 sets the size in bytes
  • DCACHE_DEBUG=32 adds dCache IO debug information [58] is the code preceding RA_BUFFER information (Once it was [8] according to the script)

First test, without ReadAhead buffer

Execution took a very long time, around 4h30min
$ date
Wed Sep 24 18:48:13 CDT 2008
$ time athena test_tagOptions.py > athenaout.txt 
Warning in : class pair already in TClassTable
WARNING: $POOL_CATALOG is not defined
using default `xmlcatalog_file:PoolFileCatalog.xml'
date
 
real	276m21.677s
user	3m10.077s
sys	0m12.833s
$ date
Wed Sep 24 23:24:40 CDT 2008

Single job, RA buffer

A vary similar test was repeated a couple of times after setting RA Buffer. After the job ran the dCache log file is parsed to verify how the cache was used. This job ran mach more quickly. More than 1 run, each time giving results under 10'.

$ export DCACHE_RAHEAD=1; export DCACHE_RA_BUFFER=32768; export DCACHE_DEBUG=32
$ date
Thu Sep 25 10:52:42 CDT 2008
$ time athena test_tagOptions.py 2> dcap_output > athenaout.txt
date

real    8m41.158s
user    3m50.346s
sys     0m54.975s
$ date131072
Thu Sep 25 11:01:30 CDT 2008
$ python ../analyze_output.py dcap_output
There were 130820 cache hits and 433398 cache misses (miss rate: 76.81%)
There were 11122167998 bytes read and 87024934 buffered bytes used; efficiency: 0.78%

4 Jobs with RA buffer

These are a series of tests with different buffer sizes where jobs are started 4 at the time on the machine. When running 4 jobs with the same 32K RA buffer size as before they took >20'. Probably the slow-down is due to the contemporary access to the same files.

In evaluating the buffer size keep in mind:
  • the bigger the buffer the less seeks (distinct read operations) are necessary
  • since TAG read events, not all the file there may be waste. The bigger the buffer the more may be the data that is read and not used
  • in the IO log there are lines like Sending CLOSE for fd:58 ID:20: the fd is always 58 in all tests, the ID starts from 1 and increments (1 each event) until 97 at the end of the log file. It seems that files are opened/closed 97 times (there are 3 files and >1600 events read)

Outputs are similar to the single jobs, so are the run directories. There are changes in execution times and cache efficiencies.

dCache RA Buffer: 0
Thu Sep 25 12:55:15 CDT 2008
real    84m18.989s
user    6m1.880s
sys     12m35.506s
Job ended: Thu Sep 25 14:19:34 CDT 2008
There were 168142 cache hits and 433501 cache misses (miss rate: 72.05%)
There were 279410229196 bytes read and 107268972 buffered bytes used; efficiency: 0.04%

dCache RA Buffer: 1048576
Thu Sep 25 12:55:15 CDT 2008
real    84m18.664s
user    5m38.389s
sys     12m48.059s
Job ended: Thu Sep 25 14:19:34 CDT 2008
There were 168142 cache hits and 433501 cache misses (miss rate: 72.05%)
There were 279410229196 bytes read and 107268972 buffered bytes used; efficiency: 0.04%

dCache RA Buffer: 32768
Thu Sep 25 12:53:02 CDT 2008
real    19m28.836s
user    4m11.989s
sys     1m56.881s
Job ended: Thu Sep 25 13:12:30 CDT 2008
There were 130820 cache hits and 433398 cache misses (miss rate: 76.81%)
There were 11122167998 bytes read and 87024934 buffered bytes used; efficiency: 0.78%

dCache RA Buffer: 4096    - CRASHED!
Thu Sep 25 12:53:08 CDT 2008
real    0m46.192s
user    0m40.939s
sys     0m2.617s
Job ended: Thu Sep 25 12:53:54 CDT 2008
There were 324 cache hits and 1148 cache misses (miss rate: 77.99%)
There were 16643093 bytes read and 169318 buffered bytes used; efficiency: 1.02%

After fixing the library (see below):
testC4096
dCache RA Buffer: 4096
Thu Sep 25 17:18:57 CDT 2008
real    21m1.695s
user    4m5.559s
sys     1m35.186s
Job ended: Thu Sep 25 17:39:59 CDT 2008
There were 119608 cache hits and 420595 cache misses (miss rate: 77.86%)
There were 2513337013 bytes read and 55592287 buffered bytes used; efficiency: 2.21%

testC32768
dCache RA Buffer: 32768
Thu Sep 25 17:18:57 CDT 2008
real    21m43.624s
user    4m15.301s
sys     1m51.018s
Job ended: Thu Sep 25 17:40:41 CDT 2008
There were 130747 cache hits and 433325 cache misses (miss rate: 76.82%)
There were 11122167998 bytes read and 86372081 buffered bytes used; efficiency: 0.78%

testC131072
dCache RA Buffer: 131072
Thu Sep 25 17:18:57 CDT 2008
real    33m53.036s
user    4m26.997s
sys     3m20.218s
Job ended: Thu Sep 25 17:52:50 CDT 2008
There were 136014 cache hits and 433360 cache misses (miss rate: 76.11%)
There were 40184183603 bytes read and 90121880 buffered bytes used; efficiency: 0.22%

testC1048576
dCache RA Buffer: 1048576
Thu Sep 25 17:18:57 CDT 2008
real    55m49.076s
user    5m9.624s
sys     12m1.041s
Job ended: Thu Sep 25 18:14:46 CDT 2008
There were 168127 cache hits and 433486 cache misses (miss rate: 72.05%)
There were 279410229196 bytes read and 107254921 buffered bytes used; efficiency: 0.04%

testD10000
dCache RA Buffer: 10000
Thu Sep 25 20:16:09 CDT 2008
real    8m42.898s
user    3m38.923s
sys     1m15.631s
Job ended: Thu Sep 25 20:24:52 CDT 2008
There were 127687 cache hits and 430150 cache misses (miss rate: 77.11%)
There were 4272327367 bytes read and 73825342 buffered bytes used; efficiency: 1.73%

testD20000
dCache RA Buffer: 20000
Thu Sep 25 20:16:09 CDT 2008
real    8m30.131s
user    3m37.290s
sys     1m20.336s
Job ended: Thu Sep 25 20:24:39 CDT 2008
There were 129769 cache hits and 432327 cache misses (miss rate: 76.91%)
There were 7280247541 bytes read and 79148542 buffered bytes used; efficiency: 1.09%

testD32768
dCache RA Buffer: 32768
Thu Sep 25 20:16:09 CDT 2008
real    8m47.317s
user    3m32.795s
sys     1m30.122s
Job ended: Thu Sep 25 20:24:57 CDT 2008
There were 130747 cache hits and 433325 cache misses (miss rate: 76.82%)
There were 11122167998 bytes read and 86372081 buffered bytes used; efficiency: 0.78%

testD50000
dCache RA Buffer: 50000
Thu Sep 25 20:16:09 CDT 2008
real    10m0.541s
user    3m30.422s
sys     1m37.734s
Job ended: Thu Sep 25 20:26:10 CDT 2008
There were 131093 cache hits and 433381 cache misses (miss rate: 76.78%)
There were 16320231523 bytes read and 87369593 buffered bytes used; efficiency: 0.54%

4 Jobs with RA buffer and different pools

I repeated the tests with the buffer sizes that seem to work better, this time with files on different pools:

Some more runs single job

For each run the first line is specifying the parameters:
  • job ID (e.g. testWW8000)
  • 1 if it is using the Read-Ahead buffer
  • the buffer size (ignored if buffer is not used)
  • the alternative PFC (e.g. 1 is for .. /srcdir/catalogs/PoolFileCatalog1.xml)

testWW8000, 1, 8000, 1
dCache RA Buffer: 8000
Tue Sep 30 16:32:27 CDT 2008
real    9m47.583s
user    2m48.521s
sys     0m40.986s
Job ended: Tue Sep 30 16:42:15 CDT 2008
There were 126900 cache hits and 429193 cache misses (miss rate: 77.18%)
There were 3670018416 bytes read and 70923746 buffered bytes used; efficiency: 1.93%

testWW10000, 1, 10000, 2
dCache RA Buffer: 10000
Tue Sep 30 16:32:27 CDT 2008
real    13m31.376s
user    2m50.382s
sys     0m42.183s
Job ended: Tue Sep 30 16:45:59 CDT 2008
There were 127687 cache hits and 430150 cache misses (miss rate: 77.11%)
There were 4272327367 bytes read and 73825342 buffered bytes used; efficiency: 1.73%

testWW1800, 1, 1800, 3
dCache RA Buffer: 1800
Tue Sep 30 16:32:27 CDT 2008
real    14m4.625s
user    2m52.367s
sys     0m43.171s
Job ended: Tue Sep 30 16:46:32 CDT 2008
There were 1218 cache hits and 4240 cache misses (miss rate: 77.68%)
There were 6886626 bytes read and 407976 buffered bytes used; efficiency: 5.92%

testWW25000, 1, 25000, 4
dCache RA Buffer: 25000
Tue Sep 30 16:32:27 CDT 2008
real    14m4.742s
user    2m51.891s
sys     0m46.892s
Job ended: Tue Sep 30 16:46:32 CDT 2008
There were 1766 cache hits and 5189 cache misses (miss rate: 74.61%)
There were 85787674 bytes read and 1195883 buffered bytes used; efficiency: 1.39%

Event size?

Buggy libdcap.so, Athena Crashes

I observed consistent crashes with a buffer size of 4K. These went away using the library provided by Shuwei (Shuwei YE <yesw@bnl.gov>). He signaled a known problem with libdcap.so

4096 is also the size of the buffer set for the harddrive (with ). It would be interesting to see if changing the HD buffer the RA_BUFFER causing crashes is changing as well, but this is more dCache troubleshooting and the bug is fixed.

Here is Shuwei's email:
   Yes, the dcache ReadAhead feature sometimes can improve read 
performance siginificantly (sometimtes a factor 6). The exact optimal 
value of DCACHE_RA_BUFFER depends on your application. After some tests, 
the recommended value is 128K. On default, the ReadAhead is not set. To 
use, you need set:

export DCACHE_RAHEAD=true
export DCACHE_RA_BUFFER=131072

   There was a bug in dcache ReadAhead code, which would lead to incorrect 
read under some conditions. I fixed this bug and have implemented in 

/afs/usatlas.bnl.gov/i386_redhat72/opt/dcache/vdt/dcache/dcap/lib/lib32/libdcap.so

and sym-link /opt/usatlas/lib/libdcap.so

but not in libdcap.so coming with ATLAS releases.

I decided to use LD_LIBRARY_PATH to use the new library:
$ export LD_LIBRARY_PATH=/scratch/marco/libext:$LD_LIBRARY_PATH
$ ls /scratch/marco/libext
total 272
lrwxrwxrwx  1 marco mwt2     17 Sep 25 17:14 libdcap.so -> libdcap.so-shuwei
-rw-r--r--  1 marco mwt2 271790 Sep 25 15:56 libdcap.so-shuwei

Copy of files from dCache

Multiple processes copying the same file slow down dCache: the copy of the dataset (3 files) takes an average of 2min. 4 copies going on at the same time take around 6min each 8 copies skyrocket the load to 15 and the machine becomes unresponsive (also to basic commands like top, uptime, not only fs related like ls) and the copie completed with times between 6 and 8min. dccp copy of the files takes around 3-4min

Test detail

The script used in the test, copyfile.sh, is a simple sequence of 3 dccp commands:
#!/bin/bash
# $1 is a output dir
mkdir -p $1
dccp /pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0003.1 $1 
dccp /pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0001.1 $1 
dccp /pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0002.1 $1
Single invocation:
-bash-3.00$ time ./copyfiles.sh ./infiles/s1/ >& logcopys1.out &
[1] 2090
-bash-3.00$ 
real	2m10.101s
user	0m0.559s
sys	0m24.250s
4 concurrent invocations. The invocation below is scattered few seconds around 5'45, another invocation had all 4 copies around 6'20. It seems a regular behavior to have little variability within the set.
-bash-3.00$ for i in 1 2 3 4; do time ./copyfiles.sh ./infiles/r$i/ >& logcopy$i.out &  done
[1] 29367
[2] 29368
[3] 29369
[4] 29370
-bash-3.00$ 
-bash-3.00$ 
real	5m40.627s
user	0m0.735s
sys	0m38.063s

real	5m44.070s
user	0m0.748s
sys	0m38.857s

real	5m45.549s
user	0m0.762s
sys	0m40.712s

real	5m50.606s
user	0m0.697s
sys	0m38.685s

dCache parallel copies are a little faster, under 3min each
for i in 1 2 3 4; do time ./copyfiles.sh /pnfs/uchicago.edu/test/marco/080926/rep$i/ >& logcopy_dC$i.out &  done

The copy scattered the replicas across different pools (find_file suggested by Charles Waldman):
> function get_pnfsid { cat "$(dirname $1)/.(id)($(basename $1))";}
> function find_file { dcmd.sh "ls /dcache/pool/data/$(get_pnfsid $1) 2>&1 | grep -v No" dc[3] c[034];}
> find_file 
/pnfs/uchicago.edu/test/marco/080926/rep$i/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0001.1
> done
c029    /dcache/pool/data/0000000000000000001CBC78
c030    /dcache/pool/data/0000000000000000001CBC28
c033    /dcache/pool/data/0000000000000000001CBC60
c030    /dcache/pool/data/0000000000000000001CBD18
-bash-3.00$ for i in 1 2 3 4; do find_file
/pnfs/uchicago.edu/test/marco/080926/rep$i/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0002.1; done
c027    /dcache/pool/data/0000000000000000001CBCF8
c026    /dcache/pool/data/0000000000000000001CBCA8
c034    /dcache/pool/data/0000000000000000001CBCD0
c031    /dcache/pool/data/0000000000000000001CBD88
-bash-3.00$ for i in 1 2 3 4; do find_file
/pnfs/uchicago.edu/test/marco/080926/rep$i/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10._0003.1; done
c027    /dcache/pool/data/0000000000000000001CBBD0
c031    /dcache/pool/data/0000000000000000001CBBE0
c028    /dcache/pool/data/0000000000000000001CBC00
c032    /dcache/pool/data/0000000000000000001CBBF0
-bash-3.00$ for i in 1 2 3 ; do find_file
/pnfs/uchicago.edu/atlasdatadisk/fdr08_run2/AOD/fdr08_run2.0052293.physics_Egamma.merge.AOD.o3_f8_m10/fdr08_run2.0052293.physics_Egamma.merge.
AOD.o3_f8_m10._000$i.1; done
dc1     /dcache/pool/data/000600000000000000049EE8
c002    /dcache/pool/data/000600000000000000049960
c003    /dcache/pool/data/000600000000000000049928

Conclusions

Consider these only preliminary results. This is alimited test with a specific analysys job. Different jobs may behave differently. There were few repetition and even if consistent it is not statistically significant.

Anyway
  • Enabling read-ahead can cause a significant increase in performance.
  • Enabling read-ahead can cause a significant increase of I/O transfers. The efficiency of the cache (buffered bytes used / total bytes read) is horrible.
  • Enabling read-ahead can cause a significant decrease in the number of seeks performed: (one can decrease the number of seeks by 1/3?). As a network protocol is mostly latency bound, this decreases total runtime.
  • A lower miss rate does not increase cache efficiency
  • While read-ahead helps, it does not represent a very effective caching policy for this application. The usage patterns are such that neither very small nor very large caches help out the efficiency.

-- MarcoMambelli - 26 Sep 2008

Topic attachments
I Attachment Action Size Date Who Comment
analyze_output.py.txttxt analyze_output.py.txt manage 1 K 26 Sep 2008 - 20:09 MarcoMambelli  
fdr08_run2_physics_Egamma_o3_f8_m10_1220541569_user.rootroot fdr08_run2_physics_Egamma_o3_f8_m10_1220541569_user.root manage 339 K 26 Sep 2008 - 20:07 MarcoMambelli ELSSI output
run_job.shsh run_job.sh manage 838 bytes 26 Sep 2008 - 20:08 MarcoMambelli  
setup_athena.shsh setup_athena.sh manage 186 bytes 26 Sep 2008 - 20:08 MarcoMambelli rel 14.2.20 setup
test_tagOptions.py.txttxt test_tagOptions.py.txt manage 8 K 26 Sep 2008 - 20:07 MarcoMambelli jobOption
Topic revision: r5 - 07 Oct 2008, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback