CondorUc3Cloud

Intro

This page captures some admin notes for the uc3-cloud.uchicago.edu host.

Status

Done:

  • Condor installed via puppet (Suchandra)
  • custom configurations in /etc/condor/condor_config.local.marco (uc3-sub different form uc3-cloud)
  • uc3-pool would be a better name (Condor pool, not Condor cloud)
  • Collector is: uc3-cloud.uchicago.edu
  • test2
  • Monitoring pool is: uc3-cloud.uchicago.edu:39618 with CondorView server history directory (POOL_HISTORY_DIR) /opt/condorhistory
  • Schedd (submit host) is: uc3-sub.uchicago.edu
  • Configured to accept jobs from uc3-sub
  • started monitoring collector
  • moved to puppet custom configurations in /etc/condor/condor_config.local.marco

Todo:
  • getting an error on collector status
  • Test submission from uc3-sub with flocking

Resources are showing in the monitoring collector!

Install

Requirements:
  • one host for the collector, negotiator and condorview
  • one host for the submission (schedd)
  • other clusters can join the pool
    • startd configured to join the pool on uc3-cloud(:9618)
    • ? they have to have the same UID_DOMAIN
  • other cluster can allow flocking
    • collector allows flocking form the submit host uc3-sub
    • collector reports to the monitoring collector uc3-cloud:39618

Information about having an aggregate collector: http://research.cs.wisc.edu/condor/manual/v7.7/3_13Setting_Up.html#sec:Contrib-CondorView-Install

Setting POOL_HISTORY_DIR and KEEP_POOL_HISTORY is optional. If the machine running your aggregate collector isn't already running a collector, you can just start a collector running on the default port and skip the VIEW_SERVER parameters.

uc3-cloud
## UC3 configuration
# Domain
UID_DOMAIN = uc3.org
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
#ALLOW_WRITE = *.$(UID_DOMAIN)

#  What machine is your central manager?
#CONDOR_HOST = $(FULL_HOSTNAME)
#CONDOR_HOST = uc3-cloud.mwt2.org
CONDOR_HOST = uc3-cloud.uchicago.edu

# Pool's short description
COLLECTOR_NAME = UC3 Condor pool

## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE

# Condor Collector for monitoring (CondirView)
VIEW_SERVER = $(COLLECTOR)
VIEW_SERVER_ARGS = -f -p 39618
VIEW_SERVER_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/ViewServerLog"
# CondorView parameters
POOL_HISTORY_DIR = /opt/condorhistory
#POOL_HISTORY_MAX_STORAGE =
KEEP_POOL_HISTORY = TRUE

# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names
#  startd only for test
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, VIEW_SERVER, STARTD

uc3-sub

To register in the uc3-cloud collector:
CONDOR_HOST = uc3-cloud.uchicago.edu

To allow flocking and forward reporting to monitoring collector:
# What hosts can run jobs to this cluster.
FLOCK_FROM = uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, ui-cr.uchicago.edu, ui-cr.mwt2.org
# Internal ip addresses of the cluster ADD YUOR WORKER NODES!
INTERNAL_IPS = 10.1.3.* 10.1.4.* 10.1.5 128.135.158.241 uct2-6509.uchicago.edu 

# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618

Install view client

To install it I followed the instructions in :

As documented in this email:

There are three pieces to Condor View that you need to understand.
  1. The piece that collects the statistics, the condor_collector
  2. The piece that queries the statistics, condor_stats
  3. The piece that displays the statistics as a web page, the Condor View client

To have a Condor View collector, you will add a second collector to your existing Condor pool. Things will be set up like this: 
Normal Collector ----> View Collector <---> condor_stats

The collector and condor_stats tool come directly from your Condor installation. If you've installed Condor, you've already installed 2/3 of the binaries you need, and they are the correct version. Confusingly, the Condor contrib downloads web page lets you download a view collector for an earlier version. That is okay--you already have a perfectly good view collector for Condor.

You need one more piece, and it's the Condor View Client. This is a contrib module. It looks old, but it's the most recent one that we've released. It works fine with the latest versions of Condor. We know this because we use it ourselves.

Annotated Condor configuration

There are 3 files, all in /etc/condor:
  • condor_config should be the stock configuration as coming form the RPM install. It is actually the one from the MWT2 setup.
  • config.d directory with additional separated setups
  • condor_config.local customization explained below
  • condor_config.override temporary override for testing, not in Puppet. May contain some of the entries or be an empty file.

The condor host is uc3-cloud.uchicago.edu to make it visible form the public network. Both uc3-sub and uc3-cloud should advertise always the public IP even if they are listening on all the NICs.

Below is all the content of condor_config.local with notes explaining the sections

uc3-sub

To load the override file with variable overrides
# Override file used for testing and temporary overrides.
# It is local, not in puppet.
# It should be empty once tests are over
LOCAL_CONFIG_FILE       = /etc/condor/condor_config.override

UC3 cluster host settings (should be the same on all nodes)
  • note the use of the uchicago.edu FQDN
## UC3 configuration
# Domain
UID_DOMAIN = uc3-cloud
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
#ALLOW_WRITE = *.$(UID_DOMAIN)
EMAIL_DOMAIN = $(FULL_HOSTNAME)

#  What machine is your central manager?
CONDOR_HOST = uc3-cloud.uchicago.edu

# Pool's short description
COLLECTOR_NAME = UC3 Condor pool

Network configuration:
  • daemons listen on all network interfaces
  • the public IP/name is the one advertised
  • using the shared port daemon, all incoming connection arrive to port 9618
  • the private network may be set to communicate over a different NIC to all hosts on it. Prefer not to set it to leave it
## Network interfaces
# default BIND_ALL_INTERFACES = True
NETWORK_INTERFACE = 128.135.158.243
# PRIVATE_NETWORK_INTERFACE = 10.1.3.94
# PRIVATE_NETWORK_NAME = mwt2.org

# Shared port to allow ITS connections
# Added also to the DAEMON_LIST
USE_SHARED_PORT = True
SHARED_PORT_ARGS = -p 9618
#COLLECTOR_HOST  = $(CONDOR_HOST)?sock=collector
#UPDATE_COLLECTOR_WITH_TCP = TRUE

Security settings (will move to ssh and condor mapfile - 3.6.4)
  • sec_default_authentication: REQUIRED, PREFERRED, OPTIONAL, NEVER
  • sec_default_authentication_methods: GSI, SSL, KERBEROS, PASSWORD, FS, FS_REMOTE, NTSSPI, CLAIMTOBE, ANONYMOUS
  • SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION - used in schedd and startd, specially in glidin systems or systems with high latency
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True

A Condor schedd can flock to one or more collector to submit jobs outside its pool. They are contacted in the order they are listed:
#FLOCK_TO = itb2.uchicago.edu:39618, uc3-mgt.uchicago.edu, siraf-login.bsd.uchicago.edu, condor.mwt2.org, itbv-condor.mwt2.org, itb2.uchicago.edu:39618
#FLOCK_TO = uc3-mgt.uchicago.edu, itb2.uchicago.edu:39618, appcloud01.uchicago.edu?sock=collector, siraf-login.bsd.uchicago.edu, condor.mwt2.org, itbv-condor.mwt2.org
FLOCK_TO = uc3-mgt.uchicago.edu, itb2.uchicago.edu:39618, appcloud01.uchicago.edu?sock=collector, siraf-login.bsd.uchicago.edu, condor.mwt2.org

#ALLOW_WRITE = itb2.uchicago.edu, itb2.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org
ALLOW_NEGOTIATOR = $(ALLOW_NEGOTIATOR), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org, appcloud01.uchicago.edu, condor.mwt2.org
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org, appcloud01.uchicago.edu, condor.mwt2.org

Daemons started on uc3-sub:
##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names
#  startd only for test
DAEMON_LIST = MASTER, SCHEDD, SHARED_PORT

Necessary only if you add temporary a startd to the list above in order to test jobs locally:
## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE

uc3-cloud

Sections starting with SAA (Same As Above) are the same as the ones used in the uc3-sub host

SABA- To load the override file with variable overrides
# Override file used for testing and temporary overrides.
# It is local, not in puppet.
# It should be empty once tests are over
LOCAL_CONFIG_FILE       = /etc/condor/condor_config.override

SAA - UC3 cluster host settings (should be the same on all nodes)
  • note the use of the uchicago.edu FQDN
## UC3 configuration
# Domain
#UID_DOMAIN = uc3.org
UID_DOMAIN = uc3-cloud
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
EMAIL_DOMAIN = $(FULL_HOSTNAME)
#  What machine is your central manager?
CONDOR_HOST = uc3-cloud.uchicago.edu

# Pool's short description
COLLECTOR_NAME = UC3 Condor pool

Network configuration:
  • daemons listen on all network interfaces
  • the public IP/name is the one advertised
  • the private network may be set to communicate over a different NIC to all hosts on it. Prefer not to set it to leave it
## Network interfaces
# default BIND_ALL_INTERFACES = True
NETWORK_INTERFACE = 128.135.158.205
# PRIVATE_NETWORK_INTERFACE = 10.1.3.93
# PRIVATE_NETWORK_NAME = mwt2.org

SAA - Security settings (will move to ssh and condor mapfile), no SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION for the collector/negotiator
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
#SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True

Flocking:
  • schedd allowed to flock to this collector
  • note that the collector is sending jobs only to its pool, not other collectors
# To allow flocking

# What hosts can run jobs to this cluster.
FLOCK_FROM = dcs-mjd.uchicago.edu

ALLOW_WRITE = *.mwt2.org, *.uchicago.edu, iut2-*.iu.edu, 128.135.158.*, dcs-mjd.uchicago.edu, condortst0.uchicago.edu, 10.1.3.*, 10.1.4.*, 10.1.5.*

# To allow flocking and monitring
# Internal ip addresses of the cluster ADD YUOR WORKER NODES!
INTERNAL_IPS = dcs-mjd.uchicago.edu condortst0.uchicago.edu 10.1.3.* 10.1.4.* 10.1.5 128.135.158.241 uct2-6509.uchicago.edu 128.135.158.235

Viewserver is a second collector that receives ads from other daemons (collectors, schedds, ...) and logs them on file
  • runs on port 39618
# Condor Collector for monitoring (CondirView)
VIEW_SERVER = $(COLLECTOR)
VIEW_SERVER_ARGS = -f -p 39618 -local-name VIEW_SERVER
VIEW_SERVER_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/ViewServerLog"
# CondorView parameters
VIEW_SERVER.POOL_HISTORY_DIR = /opt/condorhistory
#POOL_HISTORY_MAX_STORAGE =
VIEW_SERVER.KEEP_POOL_HISTORY = TRUE
VIEW_SERVER.CONDOR_VIEW_HOST =

Collector1 on this host reports to the condorview collector:
# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618

Daemons started on uc3-sub:
  • note VIEW_SERVER, the collector2 to collect condorview data
##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names
#  startd only for test
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, VIEW_SERVER, STARTD

Starting jobs for now
## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE


-- MarcoMambelli - 08 Mar 2012
Topic revision: r7 - 31 May 2012, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback