Transferring Files from IU Tier 3 to Castor at CERN
Introduction
Lack of resources at CERN caused Pauline Gagnon / Yi Yang to generate 4 datasets on the IU Tier 3 center (da.physics.indiana.edu). Each dataset was text output of four vectors from Sherpa that were tarred and compressed. The datasets were each about 0.25 TB (compressed) and each had between 300 and 600 files. The plan for the datasets to apply a filter at the generator level to find a very small sample of background events that could be fully simulated with GEANT4 and then reconstructed. These datasets were officially sanctioned by the Higgs group and were for use in the studying the so-called "Invisible Higgs" Higgs production mode.
After consulting with a number of people including Junichi Tanaka, Hiro Ito, and Alexei Klimentov, the scheme below was developed. The scheme uses globus-url-copy locally within Indiana to transfer the data files to the dCache storage at the IU component of the MWT2 (MWT2_IU), register the files into DQ2 at MWT2_IU, and then subscribe the files CERN so that they would be replicated into Castor. After a significant amount of experimenting this scheme was made to work with lots of help from Marco Mambelli.
This document shows how to take all of the DQ2 commands from CERN. An appendix at the end shows how to setup the DQ2 client from BNL.
A second appendix shows how to use routines written by Junichi Tanaka to transfer files to directly to CERN from an area on a local machine. This method works but is quite slowly if the files must cross from North America to Europe. It is probably much faster within Europe and especially within CERN.
Getting Grid Credentials
Note: All of these steps assume you are running on a Linux server running SLC4 or RHEL4 that is running an AFS client. It is not required to have a
local installation of the grid client or DQ2 client tools.
Before you can use the grid you must setup the grid environment and authenticate your credentials. Assuming that you have your grid certificate in you a
.globus
directory in
$HOME
, you do the following:
> source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.[c]sh
> voms-proxy-init --voms=atlas
Enter GRID pass phrase:
< Your pass phrase is not echoed. >
Your identity: /DC=org/DC=doegrids/OU=People/CN=Frederick Luehring 621522
Cannot find file or dir: /s2/luehring/.glite/vomses
Creating temporary proxy .................................. Done
Contacting voms.cern.ch:15001 [/DC=ch/DC=cern/OU=computers/CN=voms.cern.ch] "atlas" Done
Creating proxy .................................. Done
Your proxy is valid until Fri Jan 18 04:27:14 2008
Note: the names all of these commands use hyphens ("-").
Transferring the files.
Now you are ready to transfer the files. The transfer process is in three steps:
- Transfer the files to the closest grid Storage Element (SE) using globus-url-copy.
- Register the files with DQ2 using dq2_put.
- Transfer the files your BNL/CERN using dq2-register-subscription.
Transferring the files to the closest Storage Element (SE)
I have attached a shell script (CopyFromDiskToSE) that uses globus-url-copy to copy local files to a grid Storage Element (SE) such as your closest Tier 1 or Tier 2. This script is a small modification of a script call CopyFromDiskToGrid written by Junichi Tanaka. The script is a wrapper that feeds the names of all files in a directory to globus-url-copy so that that the files can be transferred to a grid storage element. The script has three arguments:
- The local directory containing the files.
- The name if the files being transferred.
- The URL of the target storage element.
There is one switch: -v which provides verbose output as the script runs. Note the speed of the transfer is dictated by the local network environment between the local computer holding the files and the Tier 1 or 2 containing the storage element. To use the script place it in a working directory and execute it:
> mkdir /Working/Dir
> cd /Working/Dir
> cp /Some/Where/CopyFromDiskToSE .
> ./CopyFromDiskToSE -v [Local Directory Containing Files] [Files] [SE URL]
The local directory must start with slash ("/"). The files must be specified with the file numbers wild carded. The URL must point at a writable location and in some cases the directory specified by the URL must exist. It is probably easiest to understand using an example:
>./CopyFromDiskToSE -v /d00/yiyang/Output_SHERPA_09/Z3bjetsmumu \
"sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu._[0-9][0-9][0-9][0-9][0-9].tar.gz" \
gsiftp://iut2-dc1.iu.edu/pnfs/iu.edu/luehring/Z3bjetsmumu
This transfer (even a high speeds to the closest Tier 2) can take many hours if you dataset is ~TB. It is wise to redirect the output into a log file so that you can look for errors. It is also a good idea to calculate an md5sum for each datafile - the md5sum can be used to check that the files in the dataset were transferred to the grid without corruption. Junichi has provided a script to the md5sum. Junichi's routine also gives the length of each file in the dataset. To use it do the following:
> cp /afs/cern.ch/user/j/jtanaka/public/HiggsWG_MC/tools/for_inputfile/CastorLevel_MakeInputFileList.sh .
> ./CastorLevel_MakeInputFileList.sh [Local Directory Containing Files] > file.txt
Calculating the md5 sums for the input files for the above example:
> ./CastorLevel_MakeInputFileList.sh /d00/yiyang/Output_SHERPA_09/Z3bjetsmumu/ > DS006595.txt &
Registering the files with DQ2
It is now necessary to register the files that you just uploaded to the grid storage element using dq2_put DQ2 client tool. The DQ2 client tools include:
- dq2_get - get a dataset from DQ2 - see https://twiki.cern.ch/twiki/bin/view/Atlas/UsingDQ2#dq2_get.
- dq2_ls - list datasets and their contents - see https://twiki.cern.ch/twiki/bin/view/Atlas/UsingDQ2#dq2_ls.
- dq2_put -register a dataset with DQ2 - see https://twiki.cern.ch/twiki/bin/view/Atlas/UsingDQ2#dq2_put.
Note: the names of these commands use underscores ("_") not hyphens ("-").
Setup these tools from CERN using AFS:
> source /afs/cern.ch/atlas/offline/external/GRID/ddm/endusers/setup.[c]sh.any
Work in another working area on the local machine but not the local directory containing the local copy of the data files:
> mkdir /Another/Working/Dir
> cd /Another/Working/Dir
The dq2_put.sh command needs three arguments:
- The full path to the local directory containing the data files that you just transferred to the SE.
- SRM address of the directory containing the copies of the files on the grid SE.
- The dataset name which can be either a user name or an official name.
You still must have a grid or voms proxy. You must also set an environment variable specifying the what storage element/grid site the data is on before executing dq2_put and after setting up the DQ2 client environment. Do this:
> export DQ2_LOCAL_ID!=[SE Name]
(sh, zsh) or > setenv DQ2_LOCAL_ID [SE NAME]
(csh)
> dq2_put -v -o -d [Local Directory Containing Files] \
-n srm://[SRM gatekeeper IP]:8443/[SE Directory Containing Files] [Dataset Name]
The local directory must again begin with slash ("/"). The -v switch is optional and produces verbose debugging output from the dq2_put command. The -o switch specifies that the data is an official dataset - do not use it when registering user datasets. The SRM address is easily constructed by modifying the URL that you used to upload the files. Continuing the previous concrete example, an actual command looks like:
> mkdir ~/Copy_Sherpa_Data
> cd ~/Copy_Sherpa_Data
> setenv DQ2_LOCAL_ID MWT2_IU
> dq2_put.sh -v -o \
-d /d00/yiyang/Output_SHERPA_09/Z3bjetsmumu \
-n srm://iut2-dc3.iu.edu:8443/pnfs/iu.edu/test/Z3bjetsmumu \
sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu
Note: the MWT2 use dCache mounted on /pnfs/iu.edu/... This command took ~1 hour to execute between the IU Tier 3 and the MWT2 center which is ~85 km away.You should redirect the output to a logfile since the dq2_put command generates an md5sum for each file on the storage element. Thus it is possible to check if the file were received at the grid storage element without being corrupted by comparing the md5 sums calculated by CastorLevel_MakeInputFileList.sh to the md5 sums calculated by the dq2_put in this step.
Subscribing the dataset to BNL and CERN
To make a dataset subscription, you need to setup the full set of DQ2 client commands for registering subscriptions, erasing datasets, etc., then you also need to source an additional file from CERN:
> source /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2.[c]sh
Note: these command names are back to using hyphens ("-") and not underscores ("_").
If you are in the US, the fastest transfers are obtained by must first subscribe the data to BNL from the grid SE and then subscribing the data from BNL to CERN:
> dq2-register-subscription -s [SE Name] [Dataset Name] BNLDISK
Dataset subscribed (archived: 0) to BNLDISK.
> dq2-register-subscription -w -s BNLDISK [Dataset Name] CERNCAF
Dataset subscribed (archived: 0) to CERNCAF.
Continuing the concrete example of the previous two sections, this is how the dataset used above was subscribed first to BNL and then CERN.
> dq2-register-subscription -s MWT2_IU sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu BNLDISK
Dataset sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu subscribed (archived: 0) to BNLDISK.
> dq2-register-subscription -w -s BNLDISK sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu CERNCAF
Dataset sherpa.006596.PtmissPlusLeptonsFilter_Z3bjetsmumu subscribed (archived: 0) to CERNCAF.
Appendix 1: Setting up the grid environment DQ2 from BNL
NOT validated yet! Use at your own risk!
Note: Some command names use hyphens ("-") and some use underscores ("_").
Assuming that you have your grid certificate in you a .globus
directory in $HOME
, you do the following:
> source /afs/usatlas.bnl.gov/lcg/current/etc/profile.d/grid_env.[c]sh
> grid-proxy-init
Your identity: /DC=org/DC=doegrids/OU=People/CN=Frederick Luehring 621522
Enter GRID pass phrase for this identity:
< Your pass phrase is not echoed. >
Creating proxy .................................. Done
Your proxy is valid until: Fri Jan 18 04:14:07 2008
To install the DQ2 client tools locally, follow the installation instructions at:
https://twiki.cern.ch/twiki/bin/view/Atlas/UsingDQ2#Installation
Please note that these instructions appear to be hardcoded to use a particular version of the client tools so be careful! A local installation setup like this effectively takes it's commands from BNL (to be confirmed). To use the local installation of the DQ2 client tools, enter:
> source /Local/Client/Tool/Directory/setup.[c]sh
To use DQ2 client from BNL (instead of a local installation) enter:
> source /afs/usatlas.bnl.gov/Grid/Don-Quijote/dq2_user_client/setup.[c,z]sh.any
To add the full set of DQ2 tools (from BNL), you must also source:
> source /afs/usatlas.bnl.gov/Grid/Don-Quijote/DQ2_0_3_client/dq2.csh
Appendix 2: Using Junichi Tanaka's scripts to Transfer the Files to Castor at CERN
This section records my initial attempts to use scripts written by Junichi Tanaka to transfer the data files to CERN. I do not recommend this method because of the slow transfer speed (at least if the data files must cross the Atlantic).
Following instructions from the Junichi (attached as a file below), attempts were made to transfer the files directly to Castor at CERN from the IU Tier 2. Junichi's tools are available at:
/afs/cern.ch/user/j/jtanaka/public/HiggsWG_MC/tools/for_inputfile
To use them you copy the tools to local working area on the server holding the files that are to be transferred to CERN. Note: The server must be running an afs client and you must be able to use files on CERN's afs area. To set this up do:
> mkdir /Working/Dir
> cd /Working/Dir
> cp /afs/cern.ch/user/j/jtanaka/public/HiggsWG_MC/tools/for_inputfile/* .
Before you can use Junichi's scripts, you must setup the CERN grid environment and obtain a valid grid proxy see Getting Grid Credentials above. You can then use Junichi's routines to copy the files to CERN:
./CastorLevel_CopyInputFiles.sh DatasetName /Local/Directory/Containing/Data/Files
Note the dataset name can either be an official dataset name (e.g. sherpa01011.006325.sherpaVBFH170wwll) or a user dataset name (e.g. user.FredLuehring.XXX.YYY). If you want an md5sum (and you should!) for each file, use Junichi's utility to do the calculation for each file in the dataset. Checking the md5sum and length for each file is a good way to insure that the files in the dataset have not been corrupted by data transfer problems. To use Junichi's md5sum utility do:
./CastorLevel_MakeInputFileList.sh /Local/Directory/Containing/Data/Files > File.txt
This file File.txt will contain one line per file in the dataset with the filename, md5sum, and length in bytes. You can rerun the utility on the files on Castor to see if the the files were transfered correctly.
Using direct these routines to transfer the files directly to CERN proved to be too slow to be practical with 4 datasets having a combined size > 1 TB. Testing showed that while transfers using the underlying grid routine (globus-url-copy) within America achieved reasonable speeds between 100 Mbps and 1 Gbps, it was only possible to transfer to Castor at ~5 Mbps which would of resulted in several hundred hours of data transfer time. The method works - it is just impracticable without some sort of dedicated network link between America and CERN. This led to development of the method described above. In passing I note that Junichi Tanaka also stated maintaining his transfer scripts was time consuming and he would prefer to avoid this.
-- FrederickLuehring - 17 Jan 2008