This is a REPLICA of the internal CheckCE
to allow public access to this page.
- do not modify this page
- if you have access to it, you may check the original CheckCE to see if there are updates
Troubleshooting a CE
This troubleshooting page assumes that OSG is installed correctly and that ATLAS required software is installed as well.
OSG sofware installation is described in https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/OverviewGuide
. A short list of the required software is:
- A working queue (Condor, PBS, ...)
- OSG_CE on the gatekeeper (Globus sw, jobmanagers, monitoring, ...): https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/CEInstallGuide
- OSG Workernode Client (clients Globus, SRM, ...): https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WorkerNodeClient
- WN should have a OS compatible with ATLAS Sw execution:
- Linux based system, preferably RHEL based, better SLC (v4.5 as Dec 07)
- With all the compatibility libraries: https://twiki.cern.ch/twiki/bin/view/Atlas/RPMcompatSLC4
- a good test could be to run kit-validation on one of those nodes for each release you have installed or you need to install (see below): as of April 2008 releases go from 11.0.42 to 14.0.1
- if WNs differ in hardware and specially in OS, it is good practice to kit-validation on one node for each different group
- You may find several discussions about compatibility libraries (some obsolete, some still useful):
- dccp if the system has to access dCache
ATLAS software includes:
- Python (>=2.4, installed if system python <2.4)
- ATLAS releases (Installed by ATLAS sw mgr -Xin- in OSG_APP/atlas_app): you are supposed to support at least all the installed releases. As of April 2008 releases go from 11.0.42 to 14.1.0.
Systems differ. Don't be confused by the release numbers used in the examples below. Just use different version numbers if you want to get the path of a different release.
Control that the local queue manager (PBS, Condor, ...) is working and you can submit jobs.
Control that you can submit Globus jobs to the cluster.
Regular jobs to (managed) fork and PBS/Condor
globus-job-run tp-osg.ci.uchicago.edu /bin/date
globus-job-run tp-osg.ci.uchicago.edu/jobmanager-pbs /bin/date
globus-job-run tier2-osg.uchicago.edu/jobmanager-condor /bin/date
(From inside the MWT2 cluster you have to use uct2-grid6.uchicago.edu instead of uct2-grid6.mwt2.org)
This requires the submit host to be contacted back from the gatekeeper, so it is failing if there are firewall problems (e.g. misconfiguration of the variables describing available ports)
globusrun -s -r uct2-grid6.mwt2.org/jobmanager-pbs '&(executable=/bin/date)(two_phase=600)(save_state=yes)'
for an explanation of the options.
Explanation of 2-phase submission
2 phase submission is a safer way to interact with a GRAM server and is used by clients like Condor-G. It implements in GRAM a 2-way commit algorithm, where the client is waiting the timeout in
RSL attribute (in seconds) before assuming a failed request.
Client Job Manager
submit job to the local scheduler
Globusrun starts an https server on the client side to interact with the gram server.
you can find a document by Jaime Frey explaining two phase submissions, related RSL attributes and statuses.
Condor submit file skeleton
universe = globus
globusscheduler = uct2-grid6.mwt2.org/jobmanager-pbs
stream_output = false
stream_error = false
transfer_output = true
transfer_error = true
output = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.out
error = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.err
log = /local/workdir/wd01/marco/proj2/server/myjobs/2008-9-18-15/MWT2_UC-2008-9-18-15-44-5-220418/pilot.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = storage_access_info.py,idfile.txt
executable = test.py
transfer_executable = true
globusrsl = (jobtype=single)(minMemory=640)(queue=prod)
environment = APP=/osg/app;GTAG=pjob1;QUIET_ASSERT=i;PANDA_JSID=PJS_315266;
arguments = -a /osg/app -q http://uct2-grid1.uchicago.edu:8000/dq2/ -s MWT2_UC -d /scratch -g /share/wn-client/ -l /osg/data
copy_to_spool = false
notification = NEVER
periodic_release = (NumSystemHolds <= 3)
periodic_remove = (NumSystemHolds > 3) || (RemoteWallClockTime > 3600*24*3)
#Requirements = (OpSys == "LINUX" && Arch == "INTEL") && (Disk >= DiskUsage) && (Memory >= 640)
#Memory = 640
#remote_initialdir = /scratch
submit_event_user_notes = pool:MWT2_UC
+panda_pilotid = "pjob1"
+panda_jobschedulerid = "PJS_315266"
+panda_CE = "MWT2_UC"
Submission using the Pilot Submitter
The Pilot Submitter is a Panda Job Submithost
that is running on tier2-06.uchciago.edu.
To use it you can go on the web at the URL http://tier2-06.uchicago.edu:8900/pandajs/
for more information.
This pilot executable checks the execution environment on the woker nodes for things like:
- python version
- OSG specific environment (OSG_xxx locations definition and content)
Check what are the results (stdout/err of the pilot - that they contain the correct information and there are no major error messages).
Regular (default) Panda Pilot
This is the pilot that is normally sent to the CE for ATLAS production using Panda. You can find a description of it at https://twiki.cern.ch/twiki/bin/view/Atlas/PandaPilot
A quick background about ATLAS job execution
In ATLAS there are 2 jobs:
- pilot, the actual job submitted tharough the Grid
- real ATLAS job: pilot asks for a job and spawns its execution
Normally they are 1-to-1, but sometime a pilot may execute more jobs
sequentially. Some job may be recovered by a second pilot.
All the execution is in an unique subdirectory created by the pilot
(PandaPilot_\): the base dir where this is created is OSG_WN_TMP
but (currently /tmp) can be configured to be any directory.
An optional part of Panda is job recovery: this is using tar files with
the execution summary and is left in the base directory if the pilot
thinks that the next pilot arriving at that node could make some progress
in recovering/continuing the failed job (e.g. output files are there and
a final registration is missing).
This recovery would take advantage of files surviving across executions,
but, as I said, it is optional.
Once you submit it:
If there are assigned and no active jobs check DQ2 server.
If jobs start to run and do not complete successfully troubleshoot the job.
If there are active jobs in the submission page you can check in "Panda monitoring info about this submission": jobs in the upper window are running/on-hold/transferring(execution completed), job in the lower window are finished/failed.
If a job is remaining stuck in "transferring" the problem is with DQ2.
- If you check the production dashboard in Panda monitoring you should see more job requests from your CE (top right section)
- This will trigger the assignment of jobs to the CE
- Once there are active jobs ATLAS jobs should start running
Troubleshoot ATLAS job execution
In the panda monitoring you can get the details about the job, specially:
- Panda ID: used in the panda monitoring system and in communications with prodsys
- Pilot ID: PJS part (panda job submitter, e.g. PJS_24659) and pilot part (e.g. pjob11)
- modificationHost: host (WN) where the job is/was running (e.g. tp-c040.ci.uchicago.edu)
- homepackage: using these numbers you will know ATLAS release and Transformation (TRF) version (e.g. AtlasProduction/188.8.131.52 -> ATLAS 12.0.6, TRF 5)
- xxxErrorCode: various error codes OK if all 0, else report the problem
- log file name: in the file table above the job info there is a log file (usually the last one) (e.g log.013442._20009.job.log.tgz.2)
- if there is a "Show log file extracts" link you can click it to view some more info (sometime useful, e.g. showing the copy command that failed)
- The "Find and view log files" link to check log files, specially XXX_stderr and XXX_stdout files. Sometime the link is not working. In that case you can find the log file following the instructions in Find_a_file_by_hand)
Stage-in, stage-out problems
Errors with gd2_get/dq2_put. These are functions used by the production system (pilot) to move files from/to the local SE (dCache at MWT2_UC in the case of UC_Teraport).
A known error:
- Check the exact error
- Can you do the transfer now (repeating the same command)?
- Is the file there (in NFS, dCache, ...)
- Are permission OK
- Can the grid user read from there?
- Can he write (there has to be group write permission because both usatlas3 and usatlas1 have to be able to write)?
Try to repeat the command by hand.
- /pnfs is causing intermittent errors when under load because the file system is inconsistent . The problem is fixed with a 'remount' but some job may fail in the mean time. Usually the globus-url-copy is failing with an authorization error.
To troubleshoot data movement problems
you can find files that are in the catalog (see Find_a_file_by_hand below) and issue one or more transfer commands (globus-url-copy, ...), anyway keep in mind that this is not an exaustive test because:
- some errors show up only under heavy load
- if a specific file is having problems you may not be transferring that exact file
Os requirement are addressed at the beginning of this page.
You can check the installed releases by listing the directories under
You can compare the installed releases with the job required release that you found at the beginning of Troubleshoot ATLAS job execution section
ATLAS software that has to be at the Site but is stored as presistent input file are DBReleases. The tar file is expanded in the run directory.
- OSG_APP/atlas_app/atlas_rel/ : show the installed releases (e.g. 12.0.5, 13.0.20)
- OSG_APP/atlas_app/atlas_rel/REL_NUMBER/AtlasProduction/ : show the transformation (e.g. /share/app/atlas_app/atlas_rel/12.3.0/AtlasProduction/ has 12.3.0 184.108.40.206)
- The Athena jobs fails quickly because the setup file (cmtsetup or setup.py) in a certain path is missing. Control and most likely the requested release is missing
- Another case of missing setup is the DBRelease setup. If this is missing, the DBRelease (e.g. DBRelease-3.1.1.tar.gz) may not be in the SE. DBReleases are input fines residing in the SE. Check their presence like you'd do for any other file (see Find_a_file_by_hand)
Jobmanager and queue manager scripts may affect job execution.
In Teraport the PBS-prologue script is one of the candidate for the lost-heartbeat failures.
It may have removed files os other jobs running on the same node of a completed job.
It has been disabled.
Find a file by hand
You need to know:
Then you can look for the file using curl (or a web browser:
$ curl http://tier2-01.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/?lfns=log.013442._20009.job.log.tgz.2 or
$ curl http://tier2-01.uchicago.edu:8000/dq2/lrc/PoolFileCatalog/?guids=76CBA8AE-2A40-DC11-9527-00A0D1E49F91
"Error. LFNs not found" is telling you that the file was not found. Else you will receive an XML output including the URL to get the file (with globus-url-copy or srmcp, dccp, ...)
Finally you can copy the file using whatever the URL requires. Or you can try to replace the method/server in the URL and use the new URL (if you know what you are doing=the CE configuration)
If anyone is providing a script (python, shell, ...) that could be useful for debugging I could add it to the job submitter.
This script could do some more tests or execute Athena for few events.
You can submit pilots in big quantity to troubleshoot a cluster at a bigger scale.
Usually do not send more than 50-100 pilots for submission. Big batch submissions may be difficult to debug and cause gatekeeper load. Prefer to send more submission in sequence (short intervals).
Keep in mind anyway that the amount of running jobs is limited by:
- the available slots in the local queue (availability, policy, ...)
- the available ATLAS job in Panda
The test pilot is doing many of there for you anyway you can do them by hand.
You should be able to run and there should be ~10GB space per job
Should be writable at least by both usatlas1 and usatlas3, but better if it is writable by the 4users.
The users are usatlas1, usatlas3, usatlas2, usatlas4.
Home include a
.globus directory. Should have space available (not over quota) and .globus subdirectories (temporary directories like gasss cache) should not be too full (old files can be deleted)
-- MarcoMambelli - 26 Mar 2008