TroubleProtoCondor
Introduction
Condor incident on UC_ATLAS_MWT2 on 2/11/08. See also mail thread to MWT2 list.
Initial observations and actions
- Afternoon of 2/11/08
- Can't get ANALY_MWT2 jobs to run
- Logged into
tier2-osg
. Found lots of fermilab df
processes running. Related to /pnfs hangs. Caused by tier2-d1 (had dCache 1.8 installed) being taken offline.
- Globus authentications failing
- /pnfs mounts removed; Condor restarted by Charles.
- fermilab "df" jobs seem to have flushed; cluster is draining though.
Second look (evening)
- Logged in evening and found all jobs in H state: CondorLog1.
- Gatekeeper not authenticating users: GlobusLog1 (actually a screen shot from a failed client authentication attempt).
- Lots of erros in the
/var/log/globus/globus-gatekeeper
log: GlobusLog2.
- Bad file descriptors in CondorLog2 (
/var/log/condor/SchedLog
) but none since last restart (CondorLog3).
- Looked at one job
condor_q -l
and found:
- HoldReason = "Cannot access initial working directory /home/usatlas1/gram_scratch_Arp7W3hQoC: No such file or directory"
- However, I was able to see that directory: GramScratchUsatlas1
- Somehow I wonder if there are remnant file handles in use by Condor. How to find out?
- Punt. I decide to do a full reboot of tier2-osg. Did not help... same symptoms.
- In the globus gatekeeper log there is a note about expired CRL: GlobusLog3
Charles Notes Feb 12
- On worker node c021,
condor_q -global
shows not only schedd
on headnode
but also many =schedd='s on worker nodes:
cgw@c021~$ condor_q -global
-- Schedd: tier2-osg.uchicago.edu : <10.255.255.253:33596>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
334046.0 usatlas1 2/10 14:40 0+20:47:38 H 0 2949.2 data -a /share/app
335652.0 usatlas1 2/10 23:53 0+16:14:42 H 0 722.7 data -a /share/app
[..many lines...]
341425.0 usatlas1 2/12 10:46 0+00:00:00 I 0 9.8 data -a /share/app
341426.0 usatlas1 2/12 10:46 0+00:00:00 I 0 9.8 data -a /share/app
163 jobs; 4 idle, 106 running, 53 held
-- Schedd: c002.local : <10.255.255.2:45026>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 usatlas1 9/20 10:43 0+00:00:00 H 0 9.8 .condor_run.3768
1 jobs; 0 idle, 0 running, 1 held
-- Schedd: c044.local : <10.255.255.44:35157>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 usatlas1 9/20 10:43 0+00:00:00 H 0 9.8 .condor_run.3768
Have seen this behavior before.
- This is due to an uncorrected error in the
condor_config
file being
propagaged via
cloner
. There was an incorrect line telling condor to
start a
schedd
on worker nodes. This has been fixed.
* However while restarting Condor I saw
c023 Shutting down Condor (fast-shutdown mode)
c024 Condor not running
c025 Shutting down Condor (fast-shutdown mode)
c026 Shutting down Condor (fast-shutdown mode)
c027 Shutting down Condor (fast-shutdown mode)
c028 Shutting down Condor (fast-shutdown mode)
c029 Shutting down Condor (fast-shutdown mode)
c030 Shutting down Condor (fast-shutdown mode)
c031 Condor not running
c032 Shutting down Condor (fast-shutdown mode)
Not clear why it was not running on c024 and c031.
- After removing
SCHEDD
from DAEMON_LIST
in condor_config
and restarting condor on all nodes,
schedd
is still running on worker
nodes - why?
--
CharlesWaldman - 12 Feb 2008
--
RobGardner - 12 Feb 2008