TroubleProtoCondor

Introduction

Condor incident on UC_ATLAS_MWT2 on 2/11/08. See also mail thread to MWT2 list.

Initial observations and actions

  • Afternoon of 2/11/08
  • Can't get ANALY_MWT2 jobs to run
  • Logged into tier2-osg. Found lots of fermilab df processes running. Related to /pnfs hangs. Caused by tier2-d1 (had dCache 1.8 installed) being taken offline.
  • Globus authentications failing
  • /pnfs mounts removed; Condor restarted by Charles.
  • fermilab "df" jobs seem to have flushed; cluster is draining though.

Second look (evening)

  • Logged in evening and found all jobs in H state: CondorLog1.
  • Gatekeeper not authenticating users: GlobusLog1 (actually a screen shot from a failed client authentication attempt).
  • Lots of erros in the /var/log/globus/globus-gatekeeper log: GlobusLog2.
  • Bad file descriptors in CondorLog2 (/var/log/condor/SchedLog) but none since last restart (CondorLog3).
  • Looked at one job condor_q -l and found:
    • HoldReason = "Cannot access initial working directory /home/usatlas1/gram_scratch_Arp7W3hQoC: No such file or directory"
    • However, I was able to see that directory: GramScratchUsatlas1
    • Somehow I wonder if there are remnant file handles in use by Condor. How to find out?
  • Punt. I decide to do a full reboot of tier2-osg. Did not help... same symptoms.
  • In the globus gatekeeper log there is a note about expired CRL: GlobusLog3

Charles Notes Feb 12

  • On worker node c021, condor_q -global shows not only schedd on headnode
but also many =schedd='s on worker nodes:

cgw@c021~$ condor_q -global


-- Schedd: tier2-osg.uchicago.edu : <10.255.255.253:33596>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
334046.0   usatlas1        2/10 14:40   0+20:47:38 H  0   2949.2 data -a /share/app
335652.0   usatlas1        2/10 23:53   0+16:14:42 H  0   722.7 data -a /share/app
[..many lines...]
341425.0   usatlas1        2/12 10:46   0+00:00:00 I  0   9.8  data -a /share/app
341426.0   usatlas1        2/12 10:46   0+00:00:00 I  0   9.8  data -a /share/app

163 jobs; 4 idle, 106 running, 53 held


-- Schedd: c002.local : <10.255.255.2:45026>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   usatlas1        9/20 10:43   0+00:00:00 H  0   9.8  .condor_run.3768  

1 jobs; 0 idle, 0 running, 1 held


-- Schedd: c044.local : <10.255.255.44:35157>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   usatlas1        9/20 10:43   0+00:00:00 H  0   9.8  .condor_run.3768  

Have seen this behavior before.

  • This is due to an uncorrected error in the condor_config file being
propagaged via cloner. There was an incorrect line telling condor to start a schedd on worker nodes. This has been fixed.

* However while restarting Condor I saw

c023    Shutting down Condor (fast-shutdown mode)
c024    Condor not running
c025    Shutting down Condor (fast-shutdown mode)
c026    Shutting down Condor (fast-shutdown mode)
c027    Shutting down Condor (fast-shutdown mode)
c028    Shutting down Condor (fast-shutdown mode)
c029    Shutting down Condor (fast-shutdown mode)
c030    Shutting down Condor (fast-shutdown mode)
c031    Condor not running
c032    Shutting down Condor (fast-shutdown mode)

Not clear why it was not running on c024 and c031.
  • After removing SCHEDD from DAEMON_LIST in condor_config
and restarting condor on all nodes, schedd is still running on worker nodes - why?

-- CharlesWaldman - 12 Feb 2008

-- RobGardner - 12 Feb 2008
Topic revision: r3 - 24 May 2008, RobGardner
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback