You are here: Foswiki>Main Web>ReplicaCheckCE (18 Sep 2008, MarcoMambelli)Edit Attach

This is a REPLICA of the internal CheckCE to allow public access to this page.

do not modify this page
if you have access to it, you may check the original CheckCE to see if there are updates

Troubleshooting a CE

Assumptions

This troubleshooting page assumes that OSG is installed correctly and that ATLAS required software is installed as well. OSG sofware installation is described in https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/OverviewGuide. A short list of the required software is:

A working queue (Condor, PBS, ...)
OSG_CE on the gatekeeper (Globus sw, jobmanagers, monitoring, ...): https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/CEInstallGuide
OSG Workernode Client (clients Globus, SRM, ...): https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WorkerNodeClient
WN should have a OS compatible with ATLAS Sw execution:
- Linux based system, preferably RHEL based, better SLC (v4.5 as Dec 07)
- With all the compatibility libraries: https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4
- a good test could be to run kit-validation on one of those nodes for each release you have installed or you need to install (see below): as of April 2008 releases go from 11.0.42 to 14.0.1
- if WNs differ in hardware and specially in OS, it is good practice to kit-validation on one node for each different group
- You may find several discussions about compatibility libraries (some obsolete, some still useful):
  - Fred recommended also this Hypernews entry about 12.0.x compatibility libraries for SLC3: https://hypernews.cern.ch/HyperNews/Atlas/get/releaseKitProblem/278/2/1/1/1/1/2/2/1/1/1/1/1/1/2/2.html
dccp if the system has to access dCache

ATLAS software includes:

Python (>=2.4, installed if system python <2.4)
ATLAS releases (Installed by ATLAS sw mgr -Xin- in OSG_APP/atlas_app): you are supposed to support at least all the installed releases. As of April 2008 releases go from 11.0.42 to 14.1.0.

Systems differ. Don't be confused by the release numbers used in the examples below. Just use different version numbers if you want to get the path of a different release.

Local submission

Control that the local queue manager (PBS, Condor, ...) is working and you can submit jobs.

Gatekeeper submission

Control that you can submit Globus jobs to the cluster.

Regular jobs to (managed) fork and PBS/Condor

globus-job-run tp-osg.ci.uchicago.edu /bin/date globus-job-run tp-osg.ci.uchicago.edu/jobmanager-pbs /bin/date globus-job-run tier2-osg.uchicago.edu/jobmanager-condor /bin/date

(From inside the MWT2 cluster you have to use uct2-grid6.uchicago.edu instead of uct2-grid6.mwt2.org)

Two-way submission

This requires the submit host to be contacted back from the gatekeeper, so it is failing if there are firewall problems (e.g. misconfiguration of the variables describing available ports) globusrun -s -r uct2-grid6.mwt2.org/jobmanager-pbs '&(executable=/bin/date)(two_phase=600)(save_state=yes)'

Use globusrun -help for an explanation of the options.

Explanation of 2-phase submission

2 phase submission is a safer way to interact with a GRAM server and is used by clients like Condor-G. It implements in GRAM a 2-way commit algorithm, where the client is waiting the timeout in two_phase RSL attribute (in seconds) before assuming a failed request.

Client  				Job Manager
        job request
	-------------------->         
 	WAIT_FOR_COMMIT		
 	<-------------------		
     	JOB_SIGNAL_COMMIT_REQUEST
	-------------------->            
				    submit job to the local scheduler

Globusrun starts an https server on the client side to interact with the gram server. Here you can find a document by Jaime Frey explaining two phase submissions, related RSL attributes and statuses.

Condor submit file skeleton

universe = globus
globusscheduler = uct2-grid6.mwt2.org/jobmanager-pbs
stream_output = false
stream_error  = false
transfer_output = true
transfer_error = true
output = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.out
error = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.err
log = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = 
transfer_input_files = storage_access_info.py,idfile.txt
executable = test.py
transfer_executable = true
globusrsl = (jobtype=single)(minMemory=640)(queue=prod)
environment = APP=/osg/app;GTAG=pjob1;QUIET_ASSERT=i;PANDA_JSID=PJS_315266;
arguments = -a /osg/app -q http://uct2-grid1.uchicago.edu:8000/dq2/ -s MWT2_UC -d /scratch -g /share/wn-client/ -l /osg/data 
copy_to_spool = false
notification = NEVER
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3) || (RemoteWallClockTime > 3600*24*3)
#Requirements = (OpSys == "LINUX" && Arch == "INTEL") && (Disk >= DiskUsage) && (Memory >= 640) 
#Memory = 640
#remote_initialdir = /scratch
submit_event_user_notes = pool:MWT2_UC
+panda_pilotid = "pjob1"
+panda_jobschedulerid = "PJS_315266"
+panda_CE = "MWT2_UC"
queue

Submission using the Pilot Submitter

The Pilot Submitter is a Panda Job Submithost that is running on tier2-06.uchciago.edu. To use it you can go on the web at the URL http://tier2-06.uchicago.edu:8900/pandajs/ Check PandaSubmitHost for more information.

Test pilots

This pilot executable checks the execution environment on the woker nodes for things like:

python version
environment
OSG specific environment (OSG_xxx locations definition and content)

Check what are the results (stdout/err of the pilot - that they contain the correct information and there are no major error messages).

Regular (default) Panda Pilot

This is the pilot that is normally sent to the CE for ATLAS production using Panda. You can find a description of it at https://twiki.cern.ch/twiki/bin/view/Atlas/PandaPilot

A quick background about ATLAS job execution

In ATLAS there are 2 jobs:

pilot, the actual job submitted tharough the Grid
real ATLAS job: pilot asks for a job and spawns its execution

Normally they are 1-to-1, but sometime a pilot may execute more jobs sequentially. Some job may be recovered by a second pilot. All the execution is in an unique subdirectory created by the pilot (PandaPilot_\): the base dir where this is created is OSG_WN_TMP but (currently /tmp) can be configured to be any directory. An optional part of Panda is job recovery: this is using tar files with the execution summary and is left in the base directory if the pilot thinks that the next pilot arriving at that node could make some progress in recovering/continuing the failed job (e.g. output files are there and a final registration is missing). This recovery would take advantage of files surviving across executions, but, as I said, it is optional.

Once you submit it:

If you check the production dashboard in Panda monitoring you should see more job requests from your CE (top right section)
This will trigger the assignment of jobs to the CE
Once there are active jobs ATLAS jobs should start running

If there are assigned and no active jobs check DQ2 server. If jobs start to run and do not complete successfully troubleshoot the job. If there are active jobs in the submission page you can check in "Panda monitoring info about this submission": jobs in the upper window are running/on-hold/transferring(execution completed), job in the lower window are finished/failed. If a job is remaining stuck in "transferring" the problem is with DQ2.

Troubleshoot ATLAS job execution

In the panda monitoring you can get the details about the job, specially:

Panda ID: used in the panda monitoring system and in communications with prodsys
Pilot ID: PJS part (panda job submitter, e.g. PJS_24659) and pilot part (e.g. pjob11)
modificationHost: host (WN) where the job is/was running (e.g. tp-c040.ci.uchicago.edu)
homepackage: using these numbers you will know ATLAS release and Transformation (TRF) version (e.g. AtlasProduction/12.0.6.5 -> ATLAS 12.0.6, TRF 5)
xxxErrorCode: various error codes OK if all 0, else report the problem
log file name: in the file table above the job info there is a log file (usually the last one) (e.g log.013442._20009.job.log.tgz.2)
if there is a "Show log file extracts" link you can click it to view some more info (sometime useful, e.g. showing the copy command that failed)
The "Find and view log files" link to check log files, specially XXX_stderr and XXX_stdout files. Sometime the link is not working. In that case you can find the log file following the instructions in Find_a_file_by_hand)

Stage-in, stage-out problems

Errors with gd2_get/dq2_put. These are functions used by the production system (pilot) to move files from/to the local SE (dCache at MWT2_UC in the case of UC_Teraport).

Check the exact error
Can you do the transfer now (repeating the same command)?
Is the file there (in NFS, dCache, ...)
Are permission OK
- Can the grid user read from there?
- Can he write (there has to be group write permission because both usatlas3 and usatlas1 have to be able to write)?

A known error:

/pnfs is causing intermittent errors when under load because the file system is inconsistent . The problem is fixed with a 'remount' but some job may fail in the mean time. Usually the globus-url-copy is failing with an authorization error.

Try to repeat the command by hand.

To troubleshoot data movement problems

you can find files that are in the catalog (see Find_a_file_by_hand below) and issue one or more transfer commands (globus-url-copy, ...), anyway keep in mind that this is not an exaustive test because:

some errors show up only under heavy load
if a specific file is having problems you may not be transferring that exact file

ATLAS releases

Os requirement are addressed at the beginning of this page.

You can check the installed releases by listing the directories under

OSG_APP/atlas_app/atlas_rel/ : show the installed releases (e.g. 12.0.5, 13.0.20)
OSG_APP/atlas_app/atlas_rel/REL_NUMBER/AtlasProduction/ : show the transformation (e.g. /share/app/atlas_app/atlas_rel/12.3.0/AtlasProduction/ has 12.3.0 12.3.0.1)

You can compare the installed releases with the job required release that you found at the beginning of Troubleshoot ATLAS job execution section

ATLAS software that has to be at the Site but is stored as presistent input file are DBReleases. The tar file is expanded in the run directory.

Known erros:

The Athena jobs fails quickly because the setup file (cmtsetup or setup.py) in a certain path is missing. Control and most likely the requested release is missing
Another case of missing setup is the DBRelease setup. If this is missing, the DBRelease (e.g. DBRelease-3.1.1.tar.gz) may not be in the SE. DBReleases are input fines residing in the SE. Check their presence like you'd do for any other file (see Find_a_file_by_hand)

Execution

Jobmanager and queue manager scripts may affect job execution.

In Teraport the PBS-prologue script is one of the candidate for the lost-heartbeat failures. It may have removed files os other jobs running on the same node of a completed job.

It has been disabled.

Find a file by hand

You need to know:

The web server in front of the SE where should be the file (identifird by an URL like http://tier2-01.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/)
The GUID of the file (e.g. 76CBA8AE-2A40-DC11-9527-00A0D1E49F91) or its LFN (e.g. log.013442._20009.job.log.tgz.2)

Then you can look for the file using curl (or a web browser: $ curl http://tier2-01.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/?lfns=log.013442._20009.job.log.tgz.2 or $ curl http://tier2-01.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/?guids=76CBA8AE-2A40-DC11-9527-00A0D1E49F91 "Error. LFNs not found" is telling you that the file was not found. Else you will receive an XML output including the URL to get the file (with globus-url-copy or srmcp, dccp, ...)

Finally you can copy the file using whatever the URL requires. Or you can try to replace the method/server in the URL and use the new URL (if you know what you are doing=the CE configuration)

Custom tests

If anyone is providing a script (python, shell, ...) that could be useful for debugging I could add it to the job submitter. This script could do some more tests or execute Athena for few events.

Large submission

You can submit pilots in big quantity to troubleshoot a cluster at a bigger scale. Usually do not send more than 50-100 pilots for submission. Big batch submissions may be difficult to debug and cause gatekeeper load. Prefer to send more submission in sequence (short intervals).

Keep in mind anyway that the amount of running jobs is limited by:

the available slots in the local queue (availability, policy, ...)
the available ATLAS job in Panda

Directory tests

The test pilot is doing many of there for you anyway you can do them by hand.

Local scratch

You should be able to run and there should be ~10GB space per job

SE area

Should be writable at least by both usatlas1 and usatlas3, but better if it is writable by the 4users.

Home dir

The users are usatlas1, usatlas3, usatlas2, usatlas4. Home include a .globus directory. Should have space available (not over quota) and .globus subdirectories (temporary directories like gasss cache) should not be too full (old files can be deleted)

-- MarcoMambelli - 26 Mar 2008

Topic revision: r3 - 18 Sep 2008, MarcoMambelli

Main

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback