Dataset Skimming Service
The Twiki is
http://twiki.mwt2.org/bin/view/DataServices/WebHome
There are some files in CVS
http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/PhysicsAnalysis/EventTag/DataSkimming/
For instructions check:
http://twiki.mwt2.org/bin/view/DataServices/DssCVS
There is a prototype Web GUI:
http://tier2-06.uchicago.edu:8800/dss/
Currently there are 2 branches of DSS.
One provides only CLI interface, is made of a series of scripts in shell and python, is the current comtent of the CVS repository
and is described in these twiki pages:
http://twiki.mwt2.org/bin/view/DataServices/SkimSelector
http://twiki.mwt2.org/bin/view/DataServices/SkimExtractor
This relies on the following components:
- an existing TAG database (TAG information of a dataset has been extracted and stored in a mysql table)
- a LRC catalog
- grid tools
- the possibility to submit Condor jobs (to a queue or to the Grid using Condor-G)
- ATLAS releases (on the nodes performing the operations)
It allows the execution of the following use case:
- manual skim workflow (selection )
- skim selection (the events identified by a query are recorded in a pool collection - root file)
- skim extraction, triggered by the selection (the events are extracted to output AOD files)
One uses a Web GUI and the django framework.
Currently installed on the prototype cluster.
It is a prototype server, installed on tier2-06 and not all of its pages are working
http://tier2-06.uchicago.edu:8800/dss/
Requires:
- an existing TAG database (TAG information of a dataset has been extracted and stored in a mysql table)
- a LRC catalog or a PoolFileCatalog.xml
- ATLAS releases (on the server node)
- python >= 2.3 (on the server node)
- python libraries (DB API for the TAG DB -mysql-, django 0.96) (on the server node)
It provides pages describing each of the TAG databases
It provides pages to browse the existing skims, to define new ones and to trigger different actions on a skim (selection -triggered
automatically-, extraction, registration, publication -registration and publication are currently only stubs-) and display the
results.
It allows the definition of a Skim as a query to a specific TAG database and produces an event collection (the events identified by
the query are recorded in a pool collection - root file)
Atlernatively the user can upload his own event collection.
The extraction is performed locally on the server
All the information is recorded on a DB (currently file based): this provides persstency, reliability and provenance information.
Plans
As of Wed, 13 Jun 2007
short (end of the week)
- have a working example of both brances (something that a user can test following provided instruction)
- polish partially the web interface
- organize the repository (CVS) structure and upload the server files
medium (end Aug)
- provide a service to create and download pool collections with only 'local' (to the server, to a specific CE, to CERN, to a Grid)
files, possibly including also the
PoolFileCatalog (PFNs and metadata) ## .5-1 week
-
- WL in OSG (LRC query)
- WL in LCG/ATLAS_WW
- support in the schema (new table? SE info to know what's local?)
- improve the download option for the event collection
- provide a distribution/packaging of the server-framework ## 1+2 days
- use setup.py: package the application, try packaging django and DB info
- add support for the new TAG DB: ## .5 week
- Oracle 10g: driver, etc
- remote connection to CERN
- new schema
- integrate the 2 branches (through incremental improvements of the server framework): ## 1.5 weeks
- job WL (invocation, status polling, notification)
- job WL improvements (retries, error identification, recovery)
- support for job data (in DB in directories)
- review skim schema
- improvement job support (dir purging, managing scripts...)
- split the script, port to python
- provide external (usable by others) unit testing ## .5 weeks
- unit test development
- full chain test with decent scal dataset (reserve 1 day to define/find data)
- can Jerry help/review the scripts remaining in the framework? ## 1 day ANL
- improve data movement ## .5 day
- integrate DMU (Jerry)
- data reginstration in SE, change schema/views (LRC)
- include tag generation ## .5 weeks
- review WL tag generation
- schema/view changes
- ? import/export tag information (DB or root files), load TAG files in DB
- improve TAG DB pages (composer): ## .5 - 1.5 weeks (depending on plot generation)
- new template, schema add links to available DB descriptions (learn how to add links in a combo box with auto completion)
- integrate with interactive composer (at least provide customized intructions for the DB)
- ? add some plots
- plot WL (generate the plots with root, at DB update... R&D)
- integrate the plots in the schema
- new template (if they are there, display them)
- DS definition and registration ## 1 week
- check ATLAS rules (naming convention, what is DS, where are records of it, IDs)
- WL registration
- WL browsing
- schema update and pages with DS information/registration
- improve local job submission (after remote jobs are supported) ## 1 day
- multiple local jobs
- control and feedback during execution
- scalability improvements (some more info after the unit test is working) ## 1 week?
long
- panda integration R&D
- job submission through panda
- DS available for panda
- support root files for TAG info
- verify ATLAS version compatibility
- unit test fom AOD to skim validation
1 day a week reserved for documentation/ ordinary testing
--
MarcoMambelli - 09 Aug 2007