Introduction
THIS IS A WORK IN PROGRESS!!!!!
This twiki page is intended to document a set of exercises performed using the
TagNavigatorTool (TNT) on the OSG site, tier2-06.uchicago.edu.
Workflow
TNT-OSG allows a user to:
- Select events of interest by running a query on the Tag database
- Specify an Athena job to be run on those selected events
- On the OSG grid, run the specified Athena job on the selected events. The job is split up to give one job for each file which contains those events (with an option to specify a minimum number of events to be processed per job - see below for details), with each sub-job then run on a different worker node.
- Any output files can then be registered on the OSG grid and in a DQ2 dataset - [registration is optionally specified]
- The output sandboxes from the jobs are returned to the user and the completed dataset can be registered at a site chosen by the user.
Installation and Running Instructions
Prerequisites
CVS
The TNT code, which is currently designed to work solely on LCG sites, is kept in the ATLAS CVS under
http://isscvs.cern.ch/cgi-bin/viewcvs-all.cgi/offline/Database/AthenaPOOL/POOLCollectionTools/tnt/?cvsroot=atlas offline/Database/AthenaPOOL/POOLCollectionTools/tnt. This code is the start point for the TNT-OSG extensions.
Installing TNT
- Download the file tnt.tar into the tnt working directory. The working directory now contained the files:
266240 Dec 13 10:17 tnt.tar
76615 Dec 13 10:17 CollSplitByGUID.exe
I then untar'd the file tnt.tar. The resultant files and directories present after the untar were:
[gfg@tier2-06 tnt]$ pwd;ls -l
/local/workdir/tnt
total 536
authentication.xml
AUTHORS
CollSplitByGUID.exe
example
GenerateCatalogs.py
generateLCGJob.py
GuidExtractor.py
install_dq2_client.sh
LICENCE
monitorLCGJobs.py
notifyUser.py
pycurl.so
README
TNT.conf
TNT.py
tnt.tar
1. Copy the files TNT.conf and setup-tnt-env.sh to the working directory. Rename these files to TNT.conf and setup-tnt-env.sh
- Install a DQ2 client using the DQ2 install script. I chose to do this in a new sub-directory called
dq2
.
mkdir dq2
cd dq2
../install_dq2_client.sh
This downloads and installs the necessary components for the DQ2 client into the dq2
sub-directory of the tnt working directory. The version that is installed is Version=0.2.11
- Set the LFC_HOST variable:
export LFC_HOST=tier2-05.uchicago.edu
This should be the LFC containing the data you require, and where your output data will be registered.
- Set your PATH and PYTHONPATH variables to include the sub-directory dq2. THESE LINES HAVE ALREADY BEEN INCLUDED IN THE FILE setup-tnt-env.sh.
#DQ2_CLIENT_TOOLS------------------------------------------------------
export DQ2_CLIENT_PATH="/local/workdir/tnt/dq2"
echo "Setting up paths to dq2_client tools from......${DQ2_CLIENT_PATH}"
# ADD PATHS for dq2_client
export PATH=${DQ2_CLIENT_PATH}:$PATH
export PYTHONPATH=${DQ2_CLIENT_PATH}:$PYTHONPATH
- The setup file
setup-tnt-env.sh
is expected to be sourced each time the user logs into the system. This setup file file executes the following:
echo "Setting up d-cache tools"
export ATLAS_REL="12.0.3"
echo "Setting up Atlas Release for Release ${ATLAS_REL}"
export OSG_LOCATION="/local/inst/pjs"
echo "Setting up OSG grid client tools from..${OSG_LOCATION}"
export DQ2_CLIENT_PATH="/local/workdir/tnt/dq2"
echo "Setting up paths to dq2_client tools from......${DQ2_CLIENT_PATH}"
export LFC_HOST='tier2-05.uchicago.edu'
echo "Defining the global variable LFC_HOST to be ${LFC_HOST}"
echo "Checking to see if you have a valid grid certificate"
If you do not have a valid grid certificate you will be prompted with:
->Use command "grid-proxy-init" to initialize your grid certificate.
Running TNT
The main executable for TNT is the TNT.py script. It takes the following options:
Usage: TNT.py [-h | --help] [-v | --verbose] [-c | --conf configuration file] [-b | --background] [-a | --archive]
-h outputs the Usage message above
-v give more detailed logging information while TNT is running
-c specifies a configuration file to use. The default is TNT.conf
-b sets TNT to running as a background process, returning the prompt to the user and writing all output to a log file. An email notification is sent when the job is ready.
-a causes TNT to
archive certain data: the configuration file used, the event collection which resulted from the query, and the log file from TNT running. These are put in a directory of form
archive/TNT-$$
, where
$$
is the PID associated with the job.
To run TNT, a user should modify the configuration file to suit their requirements and then simply run the script. Each of the parameters in the configuration file is described below. A 'blank' configuration file is provided in the working directory and an example in the example/ directory.
The Configuration File
Parameters are included in the configuration file,
TNT.conf, with the format
PARAMETER:= VALUE
It is important to include the ":=" - this acts as the separator when parsing the file.
SRC_CONNECTION_STRING:= mysql://tagreader@tier2-06.uchicago.edu/tier2tagdb
MIN_EVENTS:=
QUERY:= NJet>0&&NLooseElectron>0
ATHENA_COMMAND:= athena -c "In=['myEvents']; CollType='ExplicitROOT'" EvenCount.py
OUTPUT_FILES:=
GRID_TYPE:= OSG
INPUT_SANDBOX:= EventCount.py
OUTPUT_SANDBOX:=
REGISTER_OUTPUT:= NO
OUTPUT_DATASET_NAME:=
OUTPUT_DATASET_LOCATION:=
EMAIL_ADDRESS:= jerryg@anl.gov
Parameter |
Value |
Default |
SRC_COLLECTION_NAME |
Tag collection you wish to query over. See here for a list of available collections. |
testIdeal07_005711_TAG_v12000201 |
SRC_COLLECTION_TYPE |
POOL collection type for Tag database. For anyone not at CERN, this should always be RelationalCollection. At CERN, one can also use MySQLltCollection if accessing the MySQL database. |
MySQLltCollection |
SRC_CONNECTION_STRING |
Connection string for Tag database. For Rome tags (which is all we can use right now) on Oracle DB, this should be oracle://atlas_tags/atlas_tags_rome |
mysql://tagreader@tier2-06.uchicago.edu/tier2tagdb |
MIN_EVENTS |
minimum number of events to write into a sub-collection (for analysis by one of the sub-jobs) |
null |
QUERY |
Query you wish to run on the Tags. See here for a list of query-able attributes. |
NJet>0&&NLooseElectron>0 |
ATHENA_COMMAND |
The Athena command you want to run, exactly as it should appear on the command-line. |
OUTPUT_FILES |
The names of any output files from your job which you want registered in LCG and/or DQ2. If there is more than one file, names should be separated by spaces only. For information on file naming, see note below |
athena -c "In=['myEvents']; CollType='ExplicitROOT'" EventCount.py |
GRID_TYPE |
LCG, OSG, or NG |
OSG |
INPUT_SANDBOX |
Any extra things you want in the input sandbox (the event list, file catalogues etc get put in automatically). This includes jobOptions for your Athena job. |
EventCount.py |
OUTPUT_SANDBOX |
Any extra things to return in your output sandbox. Only the standard output and error returned by default. |
null |
REGISTER_OUTPUT |
Whether or not to register the output files in a DQ2 dataset. Should be YES or NO |
NO |
OUTPUT_DATASET_NAME |
Name of DQ2 dataset to create. Must not exist already. |
null |
OUTPUT_DATASET_LOCATION |
Where to register DQ2 dataset when it is completed. Must be one of the DQ2-recognised site names - see here for a list. |
null |
EMAIL_ADDRESS |
If running in background mode, address to mail user at with notification of job completion. |
your_mailto_address |
A note on the output file naming convention
NOTE: THIS SECTION IS SPECIFIC TO AN LCG IMPLEMENTATION AND HAS NOT YET BEEN MODIFIED TO WORK FOR OSG
The name you give in the OUTPUT_FILES parameter is used as the basis for all the output file names for the grid jobs after they've been split. If there are N jobs, the first '.' in the given filename is interpolated with a '._x' where x is an integer between 0 and (N-1); this gives one output file per sub-job. A new directory is created in the LFC under /grid/atlas/dq2 with a name corresponding to the first part of the OUTPUT_FILES name, and all the output files are placed in that directory.
This is repeated for every individual filename specified in the OUTPUT_FILES section of the configuration file.
Example:
Suppose you set OUTPUT_FILES as joe_bloggs_output.pool.root. Then a new directory /grid/atlas/dq2/joe_bloggs_output will be created in the LFC. If your tag query results in 10 AOD files being run over by Athena, giving 10 grid jobs, then you will end up with files joe_bloggs_output._0.pool.root to joe_bloggs_output._9.pool.root in that directory.
This was implemented in this way to suit the DQ2 file naming conventions, but comments / feedback are welcome.
What happens at run-time
When TNT starts running, the following steps occur.
- First, the configuration file is parsed and the input variables stored.
- The given query is then run on the Tag database, using the connection parameters given in the configuration file. Any events which pass the query are written to a file in the working directory called myEvents.root. Any pre-existing file with that name is deleted.
- A list of the GUIDs of files which contain the events in myEvents.root is then extracted, and a POOL XML file catalogue generated which contains all the files. The physical filenames for these are extracted from the central LFC which was set with the LFC_HOST variable.
- The output collection, myEvents.root, is split into a number of sub-collections. There is by default one sub-collection for every file GUID contained in myEvents.root. One can also, however, set a minimum number of events per sub-collection using the MIN_EVENTS parameter in the configuration file. Then, when the events are being gathered into sub-collections, if there are fewer than MIN_EVENTS present for a particular file, these events will be grouped together with those from another file, and so on until the sub-collection contains at least MIN_EVENTS events.
- If the user has chosen to register output as a DQ2 dataset, a dataset with the selected name is created.
- The grid job executables and JDL files are generated and stored in the
jobs/
directory. For each sub-collection generated in the previous step, there will be one grid job. Names are of the form gridJob_sub_collection_X.jdl
, gridJob_sub_collection_X.sh
, where X is an index over the sub-collections. Any files with these names already in the jobs/
directory are overwritten.
- The jobs are submitted to the grid. The job IDs are stored in a file in the working directory called
jobIDfile-$$
where $$
is the script process ID, so you can check the status of the jobs any time using edg-job-status -i jobIDfile-$$
(e.g. if you are getting impatient and want to know whether your jobs are running or not).
- TNT polls the Resource Broker regularly until the jobs have all finished (either completed or failed). If a job fails because of some problem with the worker node (e.g. incorrect version of the ATLAS release, wrong python version etc), the job is resubmitted.
- As jobs finish, their output sandboxes are delivered to the
output/
directory where they can be examined at leisure.
- When all jobs have finished successfully, with required output registered correctly in LCG and DQ2, if requested, the dataset is closed, frozen and its location registered.
- If, however, the files could not all be written to the desired SE but some were instead written to the default SE, the dataset cannot be frozen. The user then needs to manually move the file(s) to the desired SE, close and register the location of the DQ2 dataset.
- In the event that some output files were not registered at all, the job is considered a failure. The user should examine the output and resubmit the job manually if necessary.
Exercises
We will run the same simple exercise that we previously ran manually as described in
DssPrototype. The first thing we need to do is set up the correct environment for TNT by executing the script
setup-tnt-env.sh.
When TNT starts running, the following steps occur.
- First, the configuration file is parsed and the input variables stored.
- The configuration file used was TNT.conf.
- The given query is then run on the Tag database, using the connection parameters given in the configuration file. Any events which pass the query are written to a file in the working directory called myEvents.root. Any pre-existing file with that name is deleted.
- The attached TNT.conf file has the
QUERY
attribute set to NJet>0&&NLooseElectron>0. This will generate a TAG collection of 4014 events as seen by the exercises in the DSSPrototype section. TNT will by default generate multiple sub-collections; each job containing a minumum of MIN_EVENTS
events. If we use the attached TNT.conf file with MIN_EVENTS
not set, TNT will generate multiple sub-collections with each collection containing a minimum of 100 events. This would result in the creation of 4014/100 = ~40 jobs. If we set MIN_EVENTS
to be 4000, TNT would generate 1 sub-collection of 4014 events. If we set MIN_EVENTS
to be 1000, TNT would generate 3 sub-collections . To determine roughly how many sub-collections will be created by TNT just divide the total number of events in the TAG collection by the attribute MIN_EVENTS
. For now, let's set the value of MIN_EVENTS
in TNT.conf to be 1000. This will result in the creation of 3 sub-collections.
- Executing TNT and aborting after 10-20 seconds we see the first few lines of output to be:
/local/workdir/tnt/dq2/lfc.py:4: RuntimeWarning: Python C API version mismatch for module _lfc: This Python has API version 1012, module _lfc has version 1011.
import _lfc
**** Welcome to TNT! ****
Using TNT.conf as configuration file
Job name is gfg-18094
Executing query 'NJet>0&&NLooseElectron>0' on tag database...
CollCopy.exe -src testIdeal07_005711_TAG_v12000201 MySQLltCollection -srcconnect mysql://tagreader@tier2- 06.uchicago.edu/tier2tagdb -dst myEvents RootCollection -queryopt 'SELECT RunNumber, EventNumber' - query "NJet>0&&NLooseElectron>0"
CollCopy: Finished copying input collection(s) `testIdeal07_005711_TAG_v12000201:MySQLltCollection' to output collection(s) `myEvents:RootCollection'
./CollSplitByGUID.exe -src myEvents RootCollection -minevents 1000
Minimum number of events: 1000
We also see that the following three sub-collections have been created:
16610 Dec 18 14:36 sub_collection_1.root
16876 Dec 18 14:36 sub_collection_2.root
26009 Dec 18 14:36 sub_collection_3.root
Using =root= to tell us how many events are in each sub-collection we see that:
sub_collection_1.root contains 1030 events
sub_collection_2.root contains 1059 events
sub_collection_3.root contains 1925 events
- A list of the GUIDs of files which contain the events in myEvents.root is then extracted, and a POOL XML file catalogue generated which contains all the files. The physical filenames for these are extracted from the central LFC which was set with the LFC_HOST variable.
- The dataset which contained all of the AOD and ESD data for this exercise was
testIdeal_07.005711.PythiaB_bbmu6mu4X.recon.AOD.v12000201
. If we query the DQ2 dataset browser for this dataset we see that
Registered locations for the latest dataset version
Complete replicas: None
Incomplete replicas: BNLPANDA UC_VOB
OSG sub-datasets, with modification dates:
testIdeal_07.005711.PythiaB_bbmu6mu4X.recon.AOD.v12000201_tid002968 2006-10-05 08:03:07
testIdeal_07.005711.PythiaB_bbmu6mu4X.recon.AOD.v12000201_tid002925 2006-10-02 13:35:45
-
- The sub-dataset testIdeal*_tid002968 is the one we are interested in since it was registered locally at the UC_VOB server. If, however, we look at the DQ2 dataset browser for this sub-dataset we see that the main catalog at CERN only sees this sub-dataset as
INCOMPLETE
and existing only at BNLPANDA and CERNCAF.
Registered locations for the latest dataset version
Complete replicas: None
Incomplete replicas: BNLPANDA CERNCAF
This causes a problem for TNT since TNT expects to be able to find all of the GUIDs associated with the dataset
testIdeal_07.005711.PythiaB_bbmu6mu4X.recon.AOD.v12000201_tid002968
at some
Completed replica. That is not true for the dataset we are interested in. The following line in the file
GuidExtractor.py was modified to look specifically for INCOMPLETE replicas:
site_map = dq.locationClient.queryDatasetLocations(vuids, dataset_names, LocationState.INCOMPLETE)
--
JerryGieraltowski - 14 Dec 2006