CondorUc3Cloud
Intro
This page captures some admin notes for the
uc3-cloud.uchicago.edu host.
Status
Done:
- Condor installed via puppet (Suchandra)
- custom configurations in /etc/condor/condor_config.local.marco (uc3-sub different form uc3-cloud)
- uc3-pool would be a better name (Condor pool, not Condor cloud)
- Collector is:
uc3-cloud.uchicago.edu
- test2
- Monitoring pool is:
uc3-cloud.uchicago.edu:39618
with CondorView server history directory (POOL_HISTORY_DIR) /opt/condorhistory
- Schedd (submit host) is:
uc3-sub.uchicago.edu
- Configured to accept jobs from uc3-sub
- started monitoring collector
- moved to puppet custom configurations in /etc/condor/condor_config.local.marco
Todo:
- getting an error on collector status
- Test submission from uc3-sub with flocking
Resources are showing in the monitoring collector!
Install
Requirements:
- one host for the collector, negotiator and condorview
- one host for the submission (schedd)
- other clusters can join the pool
- startd configured to join the pool on uc3-cloud(:9618)
- ? they have to have the same UID_DOMAIN
- other cluster can allow flocking
- collector allows flocking form the submit host uc3-sub
- collector reports to the monitoring collector uc3-cloud:39618
Information about having an aggregate collector:
http://research.cs.wisc.edu/condor/manual/v7.7/3_13Setting_Up.html#sec:Contrib-CondorView-Install
Setting POOL_HISTORY_DIR and KEEP_POOL_HISTORY is optional. If the machine running your aggregate collector isn't already running a collector, you can just start a collector running on the default port and skip the VIEW_SERVER parameters.
uc3-cloud
## UC3 configuration
# Domain
UID_DOMAIN = uc3.org
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
#ALLOW_WRITE = *.$(UID_DOMAIN)
# What machine is your central manager?
#CONDOR_HOST = $(FULL_HOSTNAME)
#CONDOR_HOST = uc3-cloud.mwt2.org
CONDOR_HOST = uc3-cloud.uchicago.edu
# Pool's short description
COLLECTOR_NAME = UC3 Condor pool
## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
# Condor Collector for monitoring (CondirView)
VIEW_SERVER = $(COLLECTOR)
VIEW_SERVER_ARGS = -f -p 39618
VIEW_SERVER_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/ViewServerLog"
# CondorView parameters
POOL_HISTORY_DIR = /opt/condorhistory
#POOL_HISTORY_MAX_STORAGE =
KEEP_POOL_HISTORY = TRUE
# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
# startd only for test
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, VIEW_SERVER, STARTD
uc3-sub
To register in the uc3-cloud collector:
CONDOR_HOST = uc3-cloud.uchicago.edu
To allow flocking and forward reporting to monitoring collector:
# What hosts can run jobs to this cluster.
FLOCK_FROM = uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, ui-cr.uchicago.edu, ui-cr.mwt2.org
# Internal ip addresses of the cluster ADD YUOR WORKER NODES!
INTERNAL_IPS = 10.1.3.* 10.1.4.* 10.1.5 128.135.158.241 uct2-6509.uchicago.edu
# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618
Install view client
To install it I followed the instructions in :
As documented
in this email:
There are three pieces to Condor View that you need to understand.
- The piece that collects the statistics, the condor_collector
- The piece that queries the statistics, condor_stats
- The piece that displays the statistics as a web page, the Condor View client
To have a Condor View collector, you will add a second collector to your existing Condor pool. Things will be set up like this:
Normal Collector ----> View Collector <---> condor_stats
The collector and condor_stats tool come directly from your Condor installation. If you've installed Condor, you've already installed 2/3 of the binaries you need, and they are the correct version. Confusingly, the Condor contrib downloads web page lets you download a view collector for an earlier version. That is okay--you already have a perfectly good view collector for Condor.
You need one more piece, and it's the Condor View Client. This is a contrib module. It looks old, but it's the most recent one that we've released. It works fine with the latest versions of Condor. We know this because we use it ourselves.
Annotated Condor configuration
There are 3 files, all in
/etc/condor
:
-
condor_config
should be the stock configuration as coming form the RPM install. It is actually the one from the MWT2 setup.
-
config.d
directory with additional separated setups
-
condor_config.local
customization explained below
-
condor_config.override
temporary override for testing, not in Puppet. May contain some of the entries or be an empty file.
The condor host is
uc3-cloud.uchicago.edu
to make it visible form the public network. Both uc3-sub and uc3-cloud should advertise always the public IP even if they are listening on all the NICs.
Below is all the content of
condor_config.local
with notes explaining the sections
uc3-sub
To load the override file with variable overrides
# Override file used for testing and temporary overrides.
# It is local, not in puppet.
# It should be empty once tests are over
LOCAL_CONFIG_FILE = /etc/condor/condor_config.override
UC3 cluster host settings (should be the same on all nodes)
- note the use of the uchicago.edu FQDN
## UC3 configuration
# Domain
UID_DOMAIN = uc3-cloud
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
#ALLOW_WRITE = *.$(UID_DOMAIN)
EMAIL_DOMAIN = $(FULL_HOSTNAME)
# What machine is your central manager?
CONDOR_HOST = uc3-cloud.uchicago.edu
# Pool's short description
COLLECTOR_NAME = UC3 Condor pool
Network configuration:
- daemons listen on all network interfaces
- the public IP/name is the one advertised
- using the shared port daemon, all incoming connection arrive to port 9618
- the private network may be set to communicate over a different NIC to all hosts on it. Prefer not to set it to leave it
## Network interfaces
# default BIND_ALL_INTERFACES = True
NETWORK_INTERFACE = 128.135.158.243
# PRIVATE_NETWORK_INTERFACE = 10.1.3.94
# PRIVATE_NETWORK_NAME = mwt2.org
# Shared port to allow ITS connections
# Added also to the DAEMON_LIST
USE_SHARED_PORT = True
SHARED_PORT_ARGS = -p 9618
#COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
#UPDATE_COLLECTOR_WITH_TCP = TRUE
Security settings (will move to ssh and condor mapfile - 3.6.4)
- sec_default_authentication: REQUIRED, PREFERRED, OPTIONAL, NEVER
- sec_default_authentication_methods: GSI, SSL, KERBEROS, PASSWORD, FS, FS_REMOTE, NTSSPI, CLAIMTOBE, ANONYMOUS
- SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION - used in schedd and startd, specially in glidin systems or systems with high latency
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
A Condor schedd can flock to one or more collector to submit jobs outside its pool. They are contacted in the order they are listed:
#FLOCK_TO = itb2.uchicago.edu:39618, uc3-mgt.uchicago.edu, siraf-login.bsd.uchicago.edu, condor.mwt2.org, itbv-condor.mwt2.org, itb2.uchicago.edu:39618
#FLOCK_TO = uc3-mgt.uchicago.edu, itb2.uchicago.edu:39618, appcloud01.uchicago.edu?sock=collector, siraf-login.bsd.uchicago.edu, condor.mwt2.org, itbv-condor.mwt2.org
FLOCK_TO = uc3-mgt.uchicago.edu, itb2.uchicago.edu:39618, appcloud01.uchicago.edu?sock=collector, siraf-login.bsd.uchicago.edu, condor.mwt2.org
#ALLOW_WRITE = itb2.uchicago.edu, itb2.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org
ALLOW_NEGOTIATOR = $(ALLOW_NEGOTIATOR), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org, appcloud01.uchicago.edu, condor.mwt2.org
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org, appcloud01.uchicago.edu, condor.mwt2.org
Daemons started on uc3-sub:
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
# startd only for test
DAEMON_LIST = MASTER, SCHEDD, SHARED_PORT
Necessary only if you add temporary a startd to the list above in order to test jobs locally:
## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
uc3-cloud
Sections starting with
SAA (Same As Above) are the same as the ones used in the uc3-sub host
SABA- To load the override file with variable overrides
# Override file used for testing and temporary overrides.
# It is local, not in puppet.
# It should be empty once tests are over
LOCAL_CONFIG_FILE = /etc/condor/condor_config.override
SAA - UC3 cluster host settings (should be the same on all nodes)
- note the use of the uchicago.edu FQDN
## UC3 configuration
# Domain
#UID_DOMAIN = uc3.org
UID_DOMAIN = uc3-cloud
# Restoring defaults
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
EMAIL_DOMAIN = $(FULL_HOSTNAME)
# What machine is your central manager?
CONDOR_HOST = uc3-cloud.uchicago.edu
# Pool's short description
COLLECTOR_NAME = UC3 Condor pool
Network configuration:
- daemons listen on all network interfaces
- the public IP/name is the one advertised
- the private network may be set to communicate over a different NIC to all hosts on it. Prefer not to set it to leave it
## Network interfaces
# default BIND_ALL_INTERFACES = True
NETWORK_INTERFACE = 128.135.158.205
# PRIVATE_NETWORK_INTERFACE = 10.1.3.93
# PRIVATE_NETWORK_NAME = mwt2.org
SAA - Security settings (will move to ssh and condor mapfile), no SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION for the collector/negotiator
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
#SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
Flocking:
- schedd allowed to flock to this collector
- note that the collector is sending jobs only to its pool, not other collectors
# To allow flocking
# What hosts can run jobs to this cluster.
FLOCK_FROM = dcs-mjd.uchicago.edu
ALLOW_WRITE = *.mwt2.org, *.uchicago.edu, iut2-*.iu.edu, 128.135.158.*, dcs-mjd.uchicago.edu, condortst0.uchicago.edu, 10.1.3.*, 10.1.4.*, 10.1.5.*
# To allow flocking and monitring
# Internal ip addresses of the cluster ADD YUOR WORKER NODES!
INTERNAL_IPS = dcs-mjd.uchicago.edu condortst0.uchicago.edu 10.1.3.* 10.1.4.* 10.1.5 128.135.158.241 uct2-6509.uchicago.edu 128.135.158.235
Viewserver is a second collector that receives ads from other daemons (collectors, schedds, ...) and logs them on file
# Condor Collector for monitoring (CondirView)
VIEW_SERVER = $(COLLECTOR)
VIEW_SERVER_ARGS = -f -p 39618 -local-name VIEW_SERVER
VIEW_SERVER_ENVIRONMENT = "_CONDOR_COLLECTOR_LOG=$(LOG)/ViewServerLog"
# CondorView parameters
VIEW_SERVER.POOL_HISTORY_DIR = /opt/condorhistory
#POOL_HISTORY_MAX_STORAGE =
VIEW_SERVER.KEEP_POOL_HISTORY = TRUE
VIEW_SERVER.CONDOR_VIEW_HOST =
Collector1 on this host reports to the condorview collector:
# This is the CondorView collector
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618
Daemons started on uc3-sub:
- note VIEW_SERVER, the collector2 to collect condorview data
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
# startd only for test
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, VIEW_SERVER, STARTD
Starting jobs for now
## It will not actually run jobs
# Job configuration
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
--
MarcoMambelli - 08 Mar 2012