Tier3 Cluster Flocking into ATLAS Connect
Overview
There are only two steps needed to allow your cluster to flock jobs into
ATLAS Connect. Jobs are routed by the Remote Cluster Connect Factory (RCCF) which handles HTCondor submissions into various connected clusters participating the ATLAS Connect.
- MWT2 Administrators must enable access for your Local SCHEDD node.
- You must enable GSI security and add the MWT2 Remote Cluster Connect Server name to the your Local SCHEDD FLOCK_TO HTCondor variable.
Note:
It is expected that your Local Site Administrator act as a registration agent on behalf of this institution's users and assumes responsibility for the actions of this user community.
Ask MWT2 Administrators to enable access
Send an email to the MWT2 Administrators,
support@mwt2.org, with the following information
- Full name for the organization such as "University of Illinois at Urbana-Champaign", "University of Chicago", "Argonne National Lab" or "Duke University"
- Your sites nickname which should be taken from the institute's domain name: "uiuc", "uchicago", "duke", "anl"
- Your sites Administrator/Contact name(s) and email address(es)
- Your Local Site SCHEDD host full qualified domain name (FQDN)
- Your Local Site SCHEDD host Distinguished Name (DN)
The MWT2 Administrators will respond with a Remote Cluster Connect (RCC) Factory Port used by the RCC Factory on the RCC Factory Server.
This port number is needed when setting up the RCC Flocking.
Certificate Authority and Host Certificate are required
Flocking to a RCC Factory Server requires GSI security to be used by the Local Site HTCondor installation.
GSI requires that your Local Site SCHEDD host have a functioning Certificate Authority (CA) (/etc/grid-security/certificates).
This SCHEDD host must also have a valid host certificate (/etc/grid-security/host[cert,key].pem) which provides the Distinguished Name (DN) of the host.
If the SCHEDD host does not have a functional CA, directions on how to install a CA are located at
Installing Certificate Authorities Certificates and related RPMs
If the SCHEDD host does not have a host certificate, one can be requested in two ways
Additions to your Local Site HTCondor
The following lines need to be added to
/etc/condor/condor_config.local
or added as a drop-in module at
/etc/condor/config.d
on the Local Site SCHEDD host.
After adding these lines you need to issue a
condor_reconfig
command for the changes to take effect.
The following is an example of a drop-in module
# cat /etc/condor/config.d/rcc-flock.conf
# Setup the FLOCK_TO the RCC Factory
FLOCK_TO = $(FLOCK_TO), uct2-bosco.uchicago.edu:<RCC_Factory_Port>?sock=collector
# Allow the RCC Factory server access to our SCHEDD
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), uct2-bosco.uchicago.edu
# Who do you trust?
GSI_DAEMON_NAME = $(GSI_DAEMON_NAME), /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu
GSI_DAEMON_CERT = /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY = /etc/grid-security/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = /etc/grid-security/certificates
# Enable authentication from tne Negotiator (This is required to run on glidein jobs)
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
RCC_Factory_Port should be replaced with the port number assigned by the MWT2 Administrators.
The above GSI_ and SEC_ values are known to work with MWT2. It is possible that your Local Site HTCondor has other requirements which may require different values.
Firewalls
By default, the MWT2 Administrators will assign a RCC Factory Port number in a range starting at 11010 and incrementally increasing
as sites are brought online and assigned numbers.
If your site is protected by a firewall, a port from this range may not work. If another port is more appropriate for use at your site,
it is possible to use an alternative. This port number cannot previously be in use by another RCC Factory currently allowing flocking into RCC Factory Server.
For example, if your site currently has the range of ports 9000-9700 open in the firewall,
it is possible for a port in this range to be used as the RCC Factory Port.
If there are no existing holes in the firewall, it will be the responsibility of your Local Site Administrator to request a hole be opened in the firewall from your Local Network Administrators. This hole must be for node "uct2-bosco.uchicago.edu" with at minimum a single port MWT2 approves for use in the flocking.
If your Local Site HTCondor configuration uses the SHARED_PORT Daemon, the SHARED_PORT and RCC Factory Port may be the same value.
Setup a Local Site SCHEDD server for Remote User Computing
If a site does not have a working HTCondor installation, the following procedure can be used to easily setup a SCHEDD
only node enabled for RCC Flocking.
The following requirements must first be meet
- An operational Linux node running EL5 or EL6
- The node must be on a public network - it cannot be on a private network behind a NAT router.
- Proper forward/reverse DNS registration of the public network IP.
- The host must have a CA
- The host must have a host certificate
- Root access to the node
- Any local firewall must have at least one port completely open in both directions (at a minimum open to all MWT2 subnets)
An attached script,
RCCcondor.sh, is provided which will install a RCC HTCondor SCHEDD on this node.
Do not use this script blindly. Read it carefully and make certain the script will not make undesired changes to your Linux node.
It would be best to use a cleanly installed Linux node which can be rebuilt easily should the end result not be what was expected.
At least one change should be made to this script prior to execution.
The variable
RCC_Factory_Port must be changed to the value assigned to the Local Site by MWT2 Administrators.
This installation of HTCondor uses the
SHARED_PORT feature. This allows HTCondor to use a single port in its communication with remote services.
This port must be open to all nodes which will need to contact this HTCondor installation. It is safe and easiest to completely open this port to the world.
If a more restrictive setting must be used, the port, at a minimum, must be fully open to all MWT2 subnets. The subnet values can be provided upon request.
The variable
RCC_Shared_Port controls which port is used by
SHARED_PORT. The default value is the value of
RCC_Factory_Port but can be changed based on local site requirements. The default setting requires that only a single port needs to be open within the firewall.
This script will perform numerous actions on your behalf
- Removes any currently installed HTCondor, logs, configuration files, etc
- Downloads and installs the HTCondor yum repository
- Downloads and installs the current stable release of HTCondor
- Disables libvirtd services (only needed if you are a hypervisor)
- Installs a local condor configuration file which enables RCC access
- Increases the default system limits for maximum file descriptors, memory, etc
Once this script has completed successfully, HTCondor can be started with
/etc/init.d/condor start
All HTCondor log files will be stored in the standard location
/var/log/condor
Inspect these logs for any errors.
Once HTCondor has started successfully, you should be able to issue any standard HTCondor command.
Test the installation by submitting a test as described in
Submit Test Jobs.
Download and submit the
hostname.cmd
condor_submit hostname.cmd
You can then check on the status of the job with
condor_q
Once the job has executed, you can check the output in the log file.
Test Job Example
The following is an example of how to run a test job, check on its status and display any output
Submitting a test job
[ddl@lx0 hostname]$ condor_submit hostname.cmd
Submitting job(s).
1 job(s) submitted to cluster 20870.
The job is queued but in the "Idle" state (I)
[ddl@lx0 hostname]$ condor_q
-- Submitter: lx0.hep.uiuc.edu : <128.174.118.140:42710> : lx0.hep.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
20870.0 ddl 12/4 15:15 0+00:00:00 I 0 0.0 hostname
1 jobs; 1 idle, 0 running, 0 held
The job has completed and is no longer in the queue
[ddl@lx0 hostname]$ condor_q
-- Submitter: lx0.hep.uiuc.edu : <128.174.118.140:42710> : lx0.hep.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
The contents of the output log
[ddl@lx0 hostname]$ cat hostname.out
################################################################################
##### #####
##### Job is running within a Remote Cluster Connect Factory #####
##### #####
##### Date: Wed Dec 4 15:16:29 CST 2013 #####
##### User: ruc.uiuc #####
##### Host: uct2-c275.mwt2.org #####
##### #####
################################################################################
uct2-c275.mwt2.org
Installation script RCCcondor.sh
The following is an explanation of the workings of the script
RCCcondor.sh.
Variables
There are three variables near the top of the script
- RCC_Factory_Port=
- RCC_Shared_Port=${RCC_Factory_Port}
- RCC_Factory_Server="uct2-bosco.uchicago.edu"
- RCC_Factory_DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu"
RCC_Factory_Port
The Port used to designate a specific RCC Factory on the RCC Factory Server.
This port is assigned by the RCC Factory Server Administrator.
If the local site has firewall restrictions in place, a mutually agreed upon port number can be used.
RCC_Shared_Port
The Shared Port used by the local HTCondor SHARED_PORT daemon which by default will be the same as the RCC Factory Port.
This port must be open in any local firewalls to the node specified by ${RCC_Factory_Server}
This port number might need to assigned by the local network administrator.
If there are no firewalls between this node and the RCC_Factory Server, the default value will work.
RCC_Factory_Server
This is the RCC Factory Server. You should not need to change this value.
For MWT2, it is "uct2-bosco.uchicago.edu".
RCC_Factory_DN
This is the DN of the RCC Factory Server. You should not need to change this value.
For MWT2, it is "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu"
Remove HTCondor
To avoid any confusion, if a current version of HTCondor is installed, it is removed and all support files are deleted
# Stop any running condor
/etc/init.d/condor stop
# Remove it completely
yum -y remove condor
rm -rf /var/lib/condor
rm -rf /var/log/condor
rm -rf /etc/condor
rm -rf /etc/sysconfig/condor
rm -rf /etc/yum.repos.d/htcondor-stable-rh${myEL}.repo
Install HTCondor
The current HTCondor yum repositories are downloaded and installed.
The current stable release of HTCondor is then downloaded and installed.
The libvirtd services are turned off as they are not needed.
# Fetch the latest repository from HTCondor
(cd /etc/yum.repos.d; wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rh${myEL}.repo)
# Reset the yum repo cache
yum clean all
# Install the new condor
yum -y install condor.${myArch}
# Condor enables these but we want them off
chkconfig libvirtd off
chkconfig libvirt-guests off
HTCondor configuration file
A copy of
/etc/condor/condor_config.local is created to setup HTCondor to participate in the RCC
rm -rf /etc/condor/condor_config.local
cat <<EOF>>/etc/condor/condor_config.local
# Condor configuration to allow a node to participate as a SCHEDD for a RCC
.
.
.
EOF
Several key components of this local configuration are described here
DAEMON_LIST
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT
This SCHEDD daemon is needed to "submit" jobs.
To FLOCK jobs to the RCCFS, the NEGOTIATOR and COLLECTOR daemons are required.
The SHARED_PORT daemon is needed so that only one port is used in the FLOCKing commnication.
Shared Port
USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p ${RCC_Shared_Port}
We enable the SHARED_PORT daemon and indicate which port to use.
COLLECTOR_HOST
COLLECTOR_HOST = \$(CONDOR_HOST):${RCC_Shared_Port}?sock=collector
The COLLECTOR must be told to also use the SHARED_PORT
FLOCK_TO
FLOCK_TO = \$(FLOCK_TO), ${RCC_Factory_Server}:${RCC_Factory_Port}?sock=collector
Enabled FLOCKing to the RCC Factory Server on the assigned port
Security changes
ALLOW_NEGOTIATOR_SCHEDD = \$(CONDOR_HOST), ${RCC_Factory_Server}
GSI_DAEMON_NAME = \$(GSI_DAEMON_NAME), ${RCC_Factory_DN}
GSI_DAEMON_CERT = /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY = /etc/grid-security/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR = /etc/grid-security/certificates
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
These changes allow the RCC Factory Server access to the local Negotiator
MAX_FILE_DESCRIPTORS
MAX_FILE_DESCRIPTORS = 20000
When using the SHARED_PORT daemon, all connection are via TCP.
Each job submitted uses 3 or more files descriptors to create the appropriate connections.
The value of 20000 will allow this HTCondor to handle over 5000 submitted jobs.
Increase system wide limits
To handle the large demands which can be placed on HTCondor,
various system wide limits must be increased. The script places a file into the limits.d
to modify these values. The important value is
nofile which must be larger than
the value given by
MAX_FILE_DESCRIPTORS
rm -rf /etc/security/limits.d/rcc.conf
cat <<EOF>>/etc/security/limits.d/rcc.conf
# Remove all the limits so we avoid trouble
* - nofile 1000000
* - nproc unlimited
* - memlock unlimited
* - locks unlimited
* - core unlimited
EOF