Dataset Skimming Service

The Twiki is http://twiki.mwt2.org/bin/view/DataServices/WebHome

There are some files in CVS http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/PhysicsAnalysis/EventTag/DataSkimming/

For instructions check: http://twiki.mwt2.org/bin/view/DataServices/DssCVS

There is a prototype Web GUI: http://tier2-06.uchicago.edu:8800/dss/

Currently there are 2 branches of DSS.

One provides only CLI interface, is made of a series of scripts in shell and python, is the current comtent of the CVS repository and is described in these twiki pages: http://twiki.mwt2.org/bin/view/DataServices/SkimSelector http://twiki.mwt2.org/bin/view/DataServices/SkimExtractor This relies on the following components:
  • an existing TAG database (TAG information of a dataset has been extracted and stored in a mysql table)
  • a LRC catalog
  • grid tools
  • the possibility to submit Condor jobs (to a queue or to the Grid using Condor-G)
  • ATLAS releases (on the nodes performing the operations)

It allows the execution of the following use case:
  • manual skim workflow (selection )
  • skim selection (the events identified by a query are recorded in a pool collection - root file)
  • skim extraction, triggered by the selection (the events are extracted to output AOD files)

One uses a Web GUI and the django framework. Currently installed on the prototype cluster. It is a prototype server, installed on tier2-06 and not all of its pages are working http://tier2-06.uchicago.edu:8800/dss/ Requires:
  • an existing TAG database (TAG information of a dataset has been extracted and stored in a mysql table)
  • a LRC catalog or a PoolFileCatalog.xml
  • ATLAS releases (on the server node)
  • python >= 2.3 (on the server node)
  • python libraries (DB API for the TAG DB -mysql-, django 0.96) (on the server node)

It provides pages describing each of the TAG databases It provides pages to browse the existing skims, to define new ones and to trigger different actions on a skim (selection -triggered automatically-, extraction, registration, publication -registration and publication are currently only stubs-) and display the results.

It allows the definition of a Skim as a query to a specific TAG database and produces an event collection (the events identified by the query are recorded in a pool collection - root file) Atlernatively the user can upload his own event collection.

The extraction is performed locally on the server

All the information is recorded on a DB (currently file based): this provides persstency, reliability and provenance information.

Plans

As of Wed, 13 Jun 2007

short (end of the week)

  • have a working example of both brances (something that a user can test following provided instruction)
  • polish partially the web interface
  • organize the repository (CVS) structure and upload the server files

medium (end Aug)

  • provide a service to create and download pool collections with only 'local' (to the server, to a specific CE, to CERN, to a Grid)
files, possibly including also the PoolFileCatalog (PFNs and metadata) ## .5-1 week
    • WL in OSG (LRC query)
    • WL in LCG/ATLAS_WW
    • support in the schema (new table? SE info to know what's local?)
    • improve the download option for the event collection
  • provide a distribution/packaging of the server-framework ## 1+2 days
    • use setup.py: package the application, try packaging django and DB info
  • add support for the new TAG DB: ## .5 week
    • Oracle 10g: driver, etc
    • remote connection to CERN
    • new schema
  • integrate the 2 branches (through incremental improvements of the server framework): ## 1.5 weeks
    • job WL (invocation, status polling, notification)
    • job WL improvements (retries, error identification, recovery)
    • support for job data (in DB in directories)
    • review skim schema
    • improvement job support (dir purging, managing scripts...)
    • split the script, port to python
  • provide external (usable by others) unit testing ## .5 weeks
    • unit test development
    • full chain test with decent scal dataset (reserve 1 day to define/find data)
  • can Jerry help/review the scripts remaining in the framework? ## 1 day ANL
  • improve data movement ## .5 day
    • integrate DMU (Jerry)
    • data reginstration in SE, change schema/views (LRC)
  • include tag generation ## .5 weeks
    • review WL tag generation
    • schema/view changes
    • ? import/export tag information (DB or root files), load TAG files in DB
  • improve TAG DB pages (composer): ## .5 - 1.5 weeks (depending on plot generation)
    • new template, schema add links to available DB descriptions (learn how to add links in a combo box with auto completion)
    • integrate with interactive composer (at least provide customized intructions for the DB)
    • ? add some plots
      • plot WL (generate the plots with root, at DB update... R&D)
      • integrate the plots in the schema
      • new template (if they are there, display them)
  • DS definition and registration ## 1 week
    • check ATLAS rules (naming convention, what is DS, where are records of it, IDs)
    • WL registration
    • WL browsing
    • schema update and pages with DS information/registration
  • improve local job submission (after remote jobs are supported) ## 1 day
    • multiple local jobs
    • control and feedback during execution
  • scalability improvements (some more info after the unit test is working) ## 1 week?

long

  • panda integration R&D
    • job submission through panda
    • DS available for panda
  • support root files for TAG info
  • verify ATLAS version compatibility
  • unit test fom AOD to skim validation

1 day a week reserved for documentation/ ordinary testing

-- MarcoMambelli - 09 Aug 2007
Topic revision: r1 - 09 Aug 2007, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback