You are here: Foswiki>Main Web>WikiUsers>RobGardner>LinearNotes>TroubleProtoCondor (24 May 2008, RobGardner)Edit Attach

TroubleProtoCondor

Introduction

Condor incident on UC_ATLAS_MWT2 on 2/11/08. See also mail thread to MWT2 list.

Initial observations and actions

Afternoon of 2/11/08
Can't get ANALY_MWT2 jobs to run
Logged into tier2-osg. Found lots of fermilab df processes running. Related to /pnfs hangs. Caused by tier2-d1 (had dCache 1.8 installed) being taken offline.
Globus authentications failing
/pnfs mounts removed; Condor restarted by Charles.
fermilab "df" jobs seem to have flushed; cluster is draining though.

Second look (evening)

Logged in evening and found all jobs in H state: CondorLog1.
Gatekeeper not authenticating users: GlobusLog1 (actually a screen shot from a failed client authentication attempt).
Lots of erros in the /var/log/globus/globus-gatekeeper log: GlobusLog2.
Bad file descriptors in CondorLog2 (/var/log/condor/SchedLog) but none since last restart (CondorLog3).
Looked at one job condor_q -l and found:
- HoldReason = "Cannot access initial working directory /home/usatlas1/gram_scratch_Arp7W3hQoC: No such file or directory"
- However, I was able to see that directory: GramScratchUsatlas1
- Somehow I wonder if there are remnant file handles in use by Condor. How to find out?
Punt. I decide to do a full reboot of tier2-osg. Did not help... same symptoms.
In the globus gatekeeper log there is a note about expired CRL: GlobusLog3

Charles Notes Feb 12

On worker node c021, condor_q -global shows not only schedd on headnode

but also many =schedd='s on worker nodes:


cgw@c021~$ condor_q -global


-- Schedd: tier2-osg.uchicago.edu : <10.255.255.253:33596>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
334046.0   usatlas1        2/10 14:40   0+20:47:38 H  0   2949.2 data -a /share/app
335652.0   usatlas1        2/10 23:53   0+16:14:42 H  0   722.7 data -a /share/app
[..many lines...]
341425.0   usatlas1        2/12 10:46   0+00:00:00 I  0   9.8  data -a /share/app
341426.0   usatlas1        2/12 10:46   0+00:00:00 I  0   9.8  data -a /share/app

163 jobs; 4 idle, 106 running, 53 held


-- Schedd: c002.local : <10.255.255.2:45026>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   usatlas1        9/20 10:43   0+00:00:00 H  0   9.8  .condor_run.3768  

1 jobs; 0 idle, 0 running, 1 held


-- Schedd: c044.local : <10.255.255.44:35157>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   usatlas1        9/20 10:43   0+00:00:00 H  0   9.8  .condor_run.3768

Have seen this behavior before.

This is due to an uncorrected error in the condor_config file being

propagaged via cloner. There was an incorrect line telling condor to start a schedd on worker nodes. This has been fixed.

* However while restarting Condor I saw

c023    Shutting down Condor (fast-shutdown mode)
c024    Condor not running
c025    Shutting down Condor (fast-shutdown mode)
c026    Shutting down Condor (fast-shutdown mode)
c027    Shutting down Condor (fast-shutdown mode)
c028    Shutting down Condor (fast-shutdown mode)
c029    Shutting down Condor (fast-shutdown mode)
c030    Shutting down Condor (fast-shutdown mode)
c031    Condor not running
c032    Shutting down Condor (fast-shutdown mode)

Not clear why it was not running on c024 and c031.

After removing SCHEDD from DAEMON_LIST in condor_config

and restarting condor on all nodes, schedd is still running on worker nodes - why?

-- CharlesWaldman - 12 Feb 2008

-- RobGardner - 12 Feb 2008

Topic revision: r3 - 24 May 2008, RobGardner

Main

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback