SkimExtractor

Developers

Marco, Jerry

Inputs

  • all global environment variables defined in the DSS configuraiton file (see skimSelector)
  • collection usable in the extractor (describes the files and the events included in the skim)

Outputs

A set of files. These files can be grouped into a dataset. The content of these files may have several options.
  1. output contents = input contents
  2. output contents = selection from canned set of options, e.g. ESD-Calo, AOD-Electron, ...
  3. output contents = user defined
The first prototype should just do the first of these. The second one requires some definition and storage of the options that make most sense to users, i.e. we don't know yet. The third is simple to implement but dangerous.

Some information about the skimming to add in the bookkeeping (SkimPubliher)

Task Description

Extracts events from an AOD (or any other sample) using the TAG-input as specified by SkimSelector. This consists in retrieving the necessary input datasets (e.g. using DQ2 and the SkimUtils), running Athena wih the input and the jobOptions provided by SkimSelector and making the skimmed files/dataset available in a staging area.

There may be caching of the produced skims.

The skimming jobs could run:
  • locally
  • in a local queue (Condor, PBS, ...)
  • on the Grid (e.g. using Pathena)

In the initial run this will consist in running Athena on the prototype machine once the inputs are provided.

The skim extractor software (currently gridjobsubmit.sh) allows a user to:
  • Select events of interest by running a query on a Tag database
  • Specify an Athena job to be run on those selected events; allowing the user to specify the number of events to skip before processing and the number of events to process
  • On the OSG grid or on the local DSS server, run the specified Athena job on the selected events. The job can be split up to give submit one job for each sub_collection which contains those events (with an option to specify a minimum number of events to be processed per job - see below for details), with each sub-job then run on a different worker node.
  • Any output files can then be registered on the OSG grid and in a DQ2 dataset - [registration is optionally specified]
  • The output sandboxes from the jobs are returned to the user and the completed dataset can be registered at a site chosen by the user.

External Dependencies

  • Local replica catalog
  • the executable script createPoolFileCatalog.sh

Current Implementation

The current workflow of the script skimExtractor.sh follows:
  • Split the input collection into sub-collections, on GUID boundaries, but with at least "DSS_minevents" events per collection
    • Similar to the strategy taken in the TNT software, the user is given the ability to request that multiple sub_collections be created from the original input collection. The sub_collections will be split on GUID boundaries with a minimum of "DSS_minevents" events per sub_collection. The intention is to then create and submit a separate grid/local job for each sub_collection. This would result in each job being run on a different worker node on the compute site (grid or local). The higher the value of "DSS_minevents", the fewer sub_collections will be created. Typically, DSS_minevents will be set to 500 or 1000. Values lower than that would probably cause more job execution queuing on the grid/local site than is necessary.
    • Before we perform a split of the collection ob GUID boundaries we should ensure that the minimum number of events in each sub_collection is greater than the number of events to skip. The ability to specify the number of events to skip and the number of events to process is an enhancement of the TNT software. If the number of events to skip before processing is greater than the minimum number of events in a sub_collection, we reset the minimum number of events per sub_collection to the sum of the number to skip plus 500. This is done so that we don't try to actually process a job which would result in no events being processed.
    • The splitting of the input collection on GUID boundaries is performed using the utility CollSplitByGuid.exe. This utility is included in the distributed DSS software since it has not yet been incorporated into the distributed ATLAS Releases.
  • Determine how many sub_collections to actually use. Splitting the input collection into multiple sub_collections may (and probably will) result in more sub_collections than we need to process. The total number of events we actually need is the sum of the number of events to skip and the number of events to process. We already know the minimum number of events in each sub-collection so we can easily determine how many sub_collections we need to actually use. Of course if the number of events needed is greater than the number of events in all of the sub_collections then we will need to use all of the sub_collections. * Cycle through each of the sub_collections to create and execute a grid/local job to process the events in each sub_collection.
    • For each sub_collection, determine the number of events to skip and the number of events to process.
      • If this is the first sub_collection, set the number of events to skip to be the value set from the DSS configuration file.
      • If this is not the first sub_colelction, set the number of events to skip to zero.
      • The "number of events to actually process in a sub_collection" is based on the values of the "number of events to process" as defined in the DSS configuration file and the "minimum number of events per sub_collection". This value will vary per sub_collection.
    • For each sub_collection a separate sub-directory "sub_collection_N" is created and the corresponding "sub_collection_N.root" file is placed in that sub-directory.
    • The utility CollListFileGUID.exe is used to generate a list of all of the GUIDS identified in in the sub_collection
    • We know that the PFNs of all the identified AOD files are stored in a local replica catlaog identified in the DSS configuration file as "DQ2_CATALOG". For THE CURRENT IMPLEMENTATION we know that the PFNs are stored using "dcache". A PoolFileCatlog.xml file is generated for each sub_collection using the the executable script createPoolFileCatalog.sh. The script createPoolFileCatalog.sh uses the following "curl" command to generate input for the contents of the PoolFileCatalog.xml file.
curl --max-time 10 ${DQ2_CATALOG}/PoolFileCatalog/?guids="${GUIDSTRING}
where GUIDSTRING is created from concatenations of at least ten GUID names
The resultant PoolFileCatalog.xml file is then edited to replace the target url for each PFN with the string "dcache:".
    • Create a condor job submit file template specific for this job. The name of this file defined as: "${collection_name}_${num_skip}_${num_events_to_process}.job" Globally identified variables are used to populate this job file as follows:
######################################################################
# Submit file template, from GriPhyN submit file
######################################################################
universe = ${DSS_UNIVERSE}
globusscheduler = ${DSS_TARGET}
globusrsl = (jobtype=single)(minMemory=640)
stream_output = false
stream_error  = false
transfer_output = true
transfer_error = true
output = ${collection_dir}/events_${num_skip}_${num_events_to_process}.stdout
error = ${collection_dir}/events_${num_skip}_${num_events_to_process}.stderr
log = ${collection_dir}/events_${num_skip}_${num_events_to_process}.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_output_files = jobinfo_${num_skip}_${num_events_to_process}.out,jobinfo_${num_skip}_${num_events_to_process}merge.out,ExtractedEvents.AOD.root
transfer_input_files = ${jobdir}/${DSS_PILOT},${jobdir}/${DSSconf},${jobdir}/EventExtractor.py,${collection_dir}/${collection_filename},${collection_dir}/PoolFileCatalog.xml
executable = ${jobdir}/${jobexec}
copy_to_spool = true
transfer_executable = true
environment = GTAG=JerryG;QUIET_ASSERT=i;
arguments = ${collection_name} ${num_skip} ${num_events_to_process}
notification = NEVER
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3) || (RemoteWallClockTime > 3600*24*3)
#Initialdir  = /scratch
#remote_initialdir = /share/tmp
submit_event_user_notes = pool:${DSS_TARGET_NAME}
+panda_pilotid = "pjobdss"
+panda_jobschedulerid = "PJS_123dss"

      • NOTES on the condor submit job:
        1. Output, error, and log files are defined to be returned to the sub_collection job directory
        2. Two jobinfo files are specifically created as output of the job on the worker node and returned to the sub_collection job directory.
        3. Input files: DSS_PILOT, DSSconf, EventExtractor.py, and PoolFileCatalog.xml are transferred by Condor to the workding directory on the worker node.
        4. The executable "jobexec" (by default, defined as dssmaster.sh) is transferred by Condor to the worker node and is executed. This file (dssmaster.sh) does nothing more than call dsspilot.sh passing to it all of it's input arguments. This is done to alleviate the often seen problem of Condor reusing an older version of the executable than the one desired by the user. The arguments to dssmaster.sh are "${collection_name} ${num_skip} ${num_events_to_process}"
    • Submit the grid/local job to condor saving the Condor job number for that submittal.
    • Loop to the next sub_collection
  • After all sub_collection jobs have been submitted, report back to the skimSelector the number of sub_collectoin jobs submitted to Condor and the Condor job number for each sub_collection job.

Questions/Comments

-- JerryGieraltowski - 24- May 2007 -- RobGardner - 06 Nov 2006
Topic revision: r11 - 24 May 2007, JerryGieraltowski
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback