Tier3 Cluster Flocking into ATLAS Connect

Overview

There are only two steps needed to allow your cluster to flock jobs into ATLAS Connect. Jobs are routed by the Remote Cluster Connect Factory (RCCF) which handles HTCondor submissions into various connected clusters participating the ATLAS Connect.

  1. MWT2 Administrators must enable access for your Local SCHEDD node.
  2. You must enable GSI security and add the MWT2 Remote Cluster Connect Server name to the your Local SCHEDD FLOCK_TO HTCondor variable.

Note:

It is expected that your Local Site Administrator act as a registration agent on behalf of this institution's users and assumes responsibility for the actions of this user community.

Ask MWT2 Administrators to enable access

Send an email to the MWT2 Administrators, support@mwt2.org, with the following information

  • Full name for the organization such as "University of Illinois at Urbana-Champaign", "University of Chicago", "Argonne National Lab" or "Duke University"
  • Your sites nickname which should be taken from the institute's domain name: "uiuc", "uchicago", "duke", "anl"
  • Your sites Administrator/Contact name(s) and email address(es)
  • Your Local Site SCHEDD host full qualified domain name (FQDN)
  • Your Local Site SCHEDD host Distinguished Name (DN)

The MWT2 Administrators will respond with a Remote Cluster Connect (RCC) Factory Port used by the RCC Factory on the RCC Factory Server. This port number is needed when setting up the RCC Flocking.

Certificate Authority and Host Certificate are required

Flocking to a RCC Factory Server requires GSI security to be used by the Local Site HTCondor installation. GSI requires that your Local Site SCHEDD host have a functioning Certificate Authority (CA) (/etc/grid-security/certificates). This SCHEDD host must also have a valid host certificate (/etc/grid-security/host[cert,key].pem) which provides the Distinguished Name (DN) of the host.

If the SCHEDD host does not have a functional CA, directions on how to install a CA are located at Installing Certificate Authorities Certificates and related RPMs

If the SCHEDD host does not have a host certificate, one can be requested in two ways

Additions to your Local Site HTCondor

The following lines need to be added to /etc/condor/condor_config.local or added as a drop-in module at /etc/condor/config.d on the Local Site SCHEDD host. After adding these lines you need to issue a condor_reconfig command for the changes to take effect.

The following is an example of a drop-in module

# cat /etc/condor/config.d/rcc-flock.conf

# Setup the FLOCK_TO the RCC Factory
FLOCK_TO                                 = $(FLOCK_TO), uct2-bosco.uchicago.edu:<RCC_Factory_Port>?sock=collector


# Allow the RCC Factory server access to our SCHEDD
ALLOW_NEGOTIATOR_SCHEDD                  = $(CONDOR_HOST), uct2-bosco.uchicago.edu


# Who do you trust?
GSI_DAEMON_NAME                          = $(GSI_DAEMON_NAME), /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu
GSI_DAEMON_CERT                          = /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY                           = /etc/grid-security/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR                = /etc/grid-security/certificates

# Enable authentication from tne Negotiator (This is required to run on glidein jobs)
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE

RCC_Factory_Port should be replaced with the port number assigned by the MWT2 Administrators.

The above GSI_ and SEC_ values are known to work with MWT2. It is possible that your Local Site HTCondor has other requirements which may require different values.

Firewalls

By default, the MWT2 Administrators will assign a RCC Factory Port number in a range starting at 11010 and incrementally increasing as sites are brought online and assigned numbers.

If your site is protected by a firewall, a port from this range may not work. If another port is more appropriate for use at your site, it is possible to use an alternative. This port number cannot previously be in use by another RCC Factory currently allowing flocking into RCC Factory Server.

For example, if your site currently has the range of ports 9000-9700 open in the firewall, it is possible for a port in this range to be used as the RCC Factory Port.

If there are no existing holes in the firewall, it will be the responsibility of your Local Site Administrator to request a hole be opened in the firewall from your Local Network Administrators. This hole must be for node "uct2-bosco.uchicago.edu" with at minimum a single port MWT2 approves for use in the flocking.

If your Local Site HTCondor configuration uses the SHARED_PORT Daemon, the SHARED_PORT and RCC Factory Port may be the same value.

Setup a Local Site SCHEDD server for Remote User Computing

If a site does not have a working HTCondor installation, the following procedure can be used to easily setup a SCHEDD only node enabled for RCC Flocking.

The following requirements must first be meet

  • An operational Linux node running EL5 or EL6
  • The node must be on a public network - it cannot be on a private network behind a NAT router.
  • Proper forward/reverse DNS registration of the public network IP.
  • The host must have a CA
  • The host must have a host certificate
  • Root access to the node
  • Any local firewall must have at least one port completely open in both directions (at a minimum open to all MWT2 subnets)

An attached script, RCCcondor.sh, is provided which will install a RCC HTCondor SCHEDD on this node.

Do not use this script blindly. Read it carefully and make certain the script will not make undesired changes to your Linux node. It would be best to use a cleanly installed Linux node which can be rebuilt easily should the end result not be what was expected.

At least one change should be made to this script prior to execution. The variable RCC_Factory_Port must be changed to the value assigned to the Local Site by MWT2 Administrators.

This installation of HTCondor uses the SHARED_PORT feature. This allows HTCondor to use a single port in its communication with remote services. This port must be open to all nodes which will need to contact this HTCondor installation. It is safe and easiest to completely open this port to the world. If a more restrictive setting must be used, the port, at a minimum, must be fully open to all MWT2 subnets. The subnet values can be provided upon request.

The variable RCC_Shared_Port controls which port is used by SHARED_PORT. The default value is the value of RCC_Factory_Port but can be changed based on local site requirements. The default setting requires that only a single port needs to be open within the firewall.

This script will perform numerous actions on your behalf

  1. Removes any currently installed HTCondor, logs, configuration files, etc
  2. Downloads and installs the HTCondor yum repository
  3. Downloads and installs the current stable release of HTCondor
  4. Disables libvirtd services (only needed if you are a hypervisor)
  5. Installs a local condor configuration file which enables RCC access
  6. Increases the default system limits for maximum file descriptors, memory, etc

Once this script has completed successfully, HTCondor can be started with

/etc/init.d/condor start

All HTCondor log files will be stored in the standard location

/var/log/condor

Inspect these logs for any errors.

Once HTCondor has started successfully, you should be able to issue any standard HTCondor command. Test the installation by submitting a test as described in Submit Test Jobs.

Download and submit the hostname.cmd

condor_submit hostname.cmd

You can then check on the status of the job with

condor_q

Once the job has executed, you can check the output in the log file.

Test Job Example

The following is an example of how to run a test job, check on its status and display any output

Submitting a test job

[ddl@lx0 hostname]$ condor_submit hostname.cmd
Submitting job(s).
1 job(s) submitted to cluster 20870.

The job is queued but in the "Idle" state (I)

[ddl@lx0 hostname]$ condor_q


-- Submitter: lx0.hep.uiuc.edu : <128.174.118.140:42710> : lx0.hep.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
20870.0   ddl            12/4  15:15   0+00:00:00 I  0   0.0  hostname          

1 jobs; 1 idle, 0 running, 0 held

The job has completed and is no longer in the queue

[ddl@lx0 hostname]$ condor_q

-- Submitter: lx0.hep.uiuc.edu : <128.174.118.140:42710> : lx0.hep.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

The contents of the output log

[ddl@lx0 hostname]$ cat hostname.out

################################################################################
#####                                                                      #####
#####        Job is running within a Remote Cluster Connect Factory        #####
#####                                                                      #####
##### Date: Wed Dec 4 15:16:29 CST 2013                                    #####
##### User: ruc.uiuc                                                       #####
##### Host: uct2-c275.mwt2.org                                             #####
#####                                                                      #####
################################################################################

uct2-c275.mwt2.org

Installation script RCCcondor.sh

The following is an explanation of the workings of the script RCCcondor.sh.

Variables

There are three variables near the top of the script

  1. RCC_Factory_Port=
  2. RCC_Shared_Port=${RCC_Factory_Port}
  3. RCC_Factory_Server="uct2-bosco.uchicago.edu"
  4. RCC_Factory_DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu"

RCC_Factory_Port

The Port used to designate a specific RCC Factory on the RCC Factory Server. This port is assigned by the RCC Factory Server Administrator. If the local site has firewall restrictions in place, a mutually agreed upon port number can be used.

RCC_Shared_Port

The Shared Port used by the local HTCondor SHARED_PORT daemon which by default will be the same as the RCC Factory Port. This port must be open in any local firewalls to the node specified by ${RCC_Factory_Server} This port number might need to assigned by the local network administrator. If there are no firewalls between this node and the RCC_Factory Server, the default value will work.

RCC_Factory_Server

This is the RCC Factory Server. You should not need to change this value. For MWT2, it is "uct2-bosco.uchicago.edu".

RCC_Factory_DN

This is the DN of the RCC Factory Server. You should not need to change this value. For MWT2, it is "/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=uct2-bosco.uchicago.edu"

Remove HTCondor

To avoid any confusion, if a current version of HTCondor is installed, it is removed and all support files are deleted

# Stop any running condor
/etc/init.d/condor stop


# Remove it completely
yum -y remove condor

rm -rf /var/lib/condor
rm -rf /var/log/condor
rm -rf /etc/condor
rm -rf /etc/sysconfig/condor
rm -rf /etc/yum.repos.d/htcondor-stable-rh${myEL}.repo

Install HTCondor

The current HTCondor yum repositories are downloaded and installed. The current stable release of HTCondor is then downloaded and installed. The libvirtd services are turned off as they are not needed.

# Fetch the latest repository from HTCondor
(cd /etc/yum.repos.d; wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rh${myEL}.repo)



# Reset the yum repo cache
yum clean all

# Install the new condor 
yum -y install condor.${myArch}


# Condor enables these but we want them off
chkconfig libvirtd off
chkconfig libvirt-guests off

HTCondor configuration file

A copy of /etc/condor/condor_config.local is created to setup HTCondor to participate in the RCC

rm -rf /etc/condor/condor_config.local
cat <<EOF>>/etc/condor/condor_config.local
# Condor configuration to allow a node to participate as a SCHEDD for a RCC
.
.
.
EOF

Several key components of this local configuration are described here

DAEMON_LIST

DAEMON_LIST                              = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT

This SCHEDD daemon is needed to "submit" jobs. To FLOCK jobs to the RCCFS, the NEGOTIATOR and COLLECTOR daemons are required. The SHARED_PORT daemon is needed so that only one port is used in the FLOCKing commnication.

Shared Port

USE_SHARED_PORT                          = TRUE
SHARED_PORT_ARGS                         = -p ${RCC_Shared_Port}

We enable the SHARED_PORT daemon and indicate which port to use.

COLLECTOR_HOST

COLLECTOR_HOST                           = \$(CONDOR_HOST):${RCC_Shared_Port}?sock=collector

The COLLECTOR must be told to also use the SHARED_PORT

FLOCK_TO

FLOCK_TO                                 = \$(FLOCK_TO), ${RCC_Factory_Server}:${RCC_Factory_Port}?sock=collector

Enabled FLOCKing to the RCC Factory Server on the assigned port

Security changes

ALLOW_NEGOTIATOR_SCHEDD                  = \$(CONDOR_HOST), ${RCC_Factory_Server}
GSI_DAEMON_NAME                          = \$(GSI_DAEMON_NAME), ${RCC_Factory_DN}
GSI_DAEMON_CERT                          = /etc/grid-security/hostcert.pem
GSI_DAEMON_KEY                           = /etc/grid-security/hostkey.pem
GSI_DAEMON_TRUSTED_CA_DIR                = /etc/grid-security/certificates
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE

These changes allow the RCC Factory Server access to the local Negotiator

MAX_FILE_DESCRIPTORS

MAX_FILE_DESCRIPTORS                     = 20000

When using the SHARED_PORT daemon, all connection are via TCP. Each job submitted uses 3 or more files descriptors to create the appropriate connections. The value of 20000 will allow this HTCondor to handle over 5000 submitted jobs.

Increase system wide limits

To handle the large demands which can be placed on HTCondor, various system wide limits must be increased. The script places a file into the limits.d to modify these values. The important value is nofile which must be larger than the value given by MAX_FILE_DESCRIPTORS

rm -rf /etc/security/limits.d/rcc.conf
cat <<EOF>>/etc/security/limits.d/rcc.conf

# Remove all the limits so we avoid trouble

* - nofile  1000000
* - nproc   unlimited
* - memlock unlimited
* - locks   unlimited
* - core    unlimited

EOF


Topic revision: r36 - 07 Aug 2014, RobGardner
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback