Transfer speed test
Introduction
This is a ballpark test of the transfer speed done measuring manually (time) the time spent by
dq2-get
to copy the files.
What I'm using:
- du (
du -sb
) to evaluate the size of the transferred files (during the test I actually used du -sh
)
- Units: MB=10^6Bytes, MiB=1024^2Bytes, Mib=1024^2bits (sometimes MiB/s is abbreviated M/s, but it is ambiguous)
- time around the command to measure the time
- a simple script to grab status snapshots:
#!/bin/sh
uptime
finger
ps -flu $1 --forest --cols=500
ls -al $2
echo "Current size:"
du -sh $2
Test 1: failed test
time dq2-get -L UCT3 -s MWT2_UC fdr08_run2.0052290.physics_Express.daq.RAW.o2 >& ../0052290_Transfer080805.log &
This test toke 23min but copied no files:
In this test dq2-get decided to use
lcg-cp
as copy command that failed because the URLs are not complete. Here e alternative invocations all failing
[uct3-edge5] /ecache/marco/test_dq2-get/timing > lcg-cp -v --vo atlas srm://uct2-dc1.uchicago.edu:8443/pnfs/uchicago.edu/data/ddm1/fdr08_run2/RAW/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data file:///ecache/marco/test_dq2-get/timing/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data
Command: lcg-cp -v --vo atlas -b -T srmv1 srm://uct2-dc1.uchicago.edu:8443/pnfs/uchicago.edu/data/ddm1/fdr08_run2/RAW/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data file:///ecache/marco/test_dq2-get/timing/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data
httpg://uct2-dc1.uchicago.edu:8443: Unknown error
lcg_cp: Communication error on send
Source SE type: SRMv1
[uct3-edge5] /ecache/marco/test_dq2-get/timing > /share/wlcg-client/lcg/bin/lcg-cp -v --vo atlas srm://uct2-dc1.uchicago.edu:8443/pnfs/uchicago.edu/data/ddm1/fdr08_run2/RAW/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data file:///ecache/marco/test_dq2-get/timing/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data
LCG_GFAL_INFOSYS not set
lcg_cp: Invalid argument
[uct3-edge5] /ecache/marco/test_dq2-get/timing > /share/wlcg-client/lcg/bin/lcg-cp -v -b -T srmv2 --vo atlas srm://uct2-dc1.uchicago.edu:8443/pnfs/uchicago.edu/data/ddm1/fdr08_run2/RAW/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data file:///ecache/marco/test_dq2-get/timing/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0014._sfo02._0001.data
Invalid request: When BDII checks are disabled, you must provide full endpoint
lcg_cp: Invalid argument
If setting LCG_GFAL_INFOSYS in the 2nd attempt there is a parsing error.
Test 2: dq2-get (srmcp)
time dq2-get -L UCT3 -s MWT2_UC -p srm fdr08_run2.0052290.physics_Express.daq.RAW.o2 >& ../0052290_Transfer080805b.log &
This test toke almost 125min and copied all files (300, ~128GB):
It toke exactly:
real 124m58.523s
user 58m0.271s
sys 19m33.982s
And was:
> du -b fdr08_run2.0052290.physics_Express.daq.RAW.o2/
136621451208 fdr08_run2.0052290.physics_Express.daq.RAW.o2/
Results:
- from UCT2 to uct3-edge5 local /ecache
- 137 GB, 300 files
- 7498 seconds
- 18.2 MB/s (145.7 Mbs)
Test 3: sequential srmcp
time ./copytest2.sh >& copytest2.log
A text file (bash script) contains the exact sequence of srmcp commands executed by Test 2. This time there is no dq2-get overhead involved and the commands are executed sequentially. Commands are like:
srmcp srm://uct2-dc1.uchicago.edu:8443/pnfs/uchicago.edu/data/ddm1/fdr08_run2/RAW/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0029._sfo04._0001.data file:////ecache/marco/test_dq2-get/timing/fdr08_run2.0052290.physics_Express.daq.RAW.o2/daq.fdr08_run2.0052290.physics.Express.LB0029._sfo04._0001.data
Exactly the same files were copied.
It took exactly:
real 173m16.120s
user 34m37.884s
sys 16m0.831s
Results:
- from UCT2 to uct3-edge5 local /ecache
- 137 GB, 300 files
- 10396 seconds
- 13.1 MB/s (105.1 Mbs)
This test is similar to the previous one with the exception that all the input/destination files couples are in a file and there is a single srmcp invocation.
time /share/wlcg-client/srm-client-fermi/bin/srmcp -copyjobfile=copyjobf.txt -report=copytest3.report >& copytest3.log
This completed with:
real 153m20.639s
user 6m50.329s
sys 14m25.589s
This test has also a second part where the same copy is tried several times changing the number of streams used by srmcp. This will test if the current limit is due to the stream.
time /share/wlcg-client/srm-client-fermi/bin/srmcp -streams_num=3 -copyjobfile=copyjobf.txt -report=copytest4.report >& copytest4.log
4: 3streams
real 126m26.115s
user 6m10.713s
sys 8m52.897s
5: 5streams
real 127m10.319s
user 6m7.940s
sys 9m19.152s
7: 7streams
real 124m50.975s
user 6m37.168s
sys 9m34.491s
Results:
- from UCT2 to uct3-edge5 local /ecache
- 137 GB, 300 files
- 1 stream: 9200 seconds, 14.9 MB/s (118.8 Mbs)
- 3 streams: 7586 seconds, 18.0 MB/s (144.1 Mbs)
- 5 streams: 7630 seconds, 17.9 MB/s (143.2 Mbs)
- 7 streams: 7491 seconds, 18.2 MB/s (145.9 Mbs)
Notes:
- uct3-edge5 crashed during the first execution of the 7 stream test. Results are from the second execution.
Test5: 2 (or more copies at the same time)
In this test 2 or more copies of the same dataset are performed at the same time from the same client host.
This will check if there are limit per transaction (or process) or if the limits are somewhere in the client or in the server.
To shorten the test and use less disk space each of these transfer involve only 100 files (45.4 GB, 45436718584 bytes). It should be still big enough not to be affected by variations in load.
2 Instances on the same client host
SC SS
time /share/wlcg-client/srm-client-fermi/bin/srmcp -streams_num=3 -copyjobfile=copyjobf100b.txt -report=copytest8b.report >& copytest8b.log
real 79m29.572s
user 2m4.430s
sys 3m37.087s
time /share/wlcg-client/srm-client-fermi/bin/srmcp -streams_num=3 -copyjobfile=copyjobf100a.txt -report=copytest8a.report >& copytest8a.log
real 77m11.139s
user 2m1.247s
sys 3m37.087s
Results:
- 45.4GB, 100 files
- same client host, same server: 9.5 MB/s (76.2 Mbs), 9.8 MB/s (78.5 Mbs), sum: 20.3 MB/s (154.7 Mbs)
- same client host, different server:
- different client host, same server:
Some performance discussion:
Speed
- 18221052.4 byte/sec
- 18.2 MB/s (145.7 Mbs, 139.0 Mibs)
- Max theoretical speed
- Ethernet: Gbs , 7 times
- disk: 1.5 Gbs (max SATA)
- read test:
- dCache read from 1 pool (3 if files are on different pools):
Further test:
- srmcp tests (loop, streams, multiple): in progress
- loop with different command (lcg-cp, ngcp)
Other data:
- 160 Mbps from UC->da.physics.indiana.edu (Tom and ) Probably iperf (mem to mem) transfer rate
Fred's rates:
- 52280: 113 GB in 12240 s = 74 Mbps (Friday afternoon)
- 52290: 133 GB in 8400 s = 126 Mbps (Monday morning)
Some thoughts and comparisons:
- local performance is not high
- anyway about the same performance is achieved with remote transfers
- remote performance (UC-IU) hits about at the same time dq2-get limit and network limit (iperf test and dq2-get local execution have about the same value). Both vould have to be improved to get a better performance
- there is no verification that the copy is correct (no checksum evaluation, this would slow down further)
- transfer rate on 8/6/08 in production FTS (UC-BNL, the highest one of the day) are even lower (1Mbs). Probably there are problems today
- transfer rate on 8/6/08 evening in production FTS (UC-BNL, the highest one of the day) are even lower (2Mbs). This is considered a good rate
- transfer test IU-UC on 8/6/08 by Sarah: sustained 500Mbs in ganglia. This involves 34 machines each one starting a 3rd party SRM copy (multiple SRM servers and gridftp doors are involved as well)
- 800 Mbs according to Cacti and other measurements (Charles' ifrate.sh script)
- test ended after ~6hours with
iut2-dc1
crashing (seems tcp memory allocation problems)
- RAID card test at MWT2_UC by Charles (MWT2 meeting 8/12/08), max throughput with
dd
:
- LSI cards max 20 MB/s
- 3Ware cards max 34 MB/s
--
MarcoMambelli - 05 Aug 2008