CampusFactoryUcItbPbs

Intro

This page captures some admin notes related to setting up an Campus Factory gateway to the OSG Integration Testbed cluster (ITB).

Status

Done:
  • Campus Factory has been installed on itb2.uchciago.edu under the user uc3 and starts correctly.
  • Collector is: itb2.uchicago.edu:39618
  • Configured to accept jobs from uc3-sub
  • Submission from ib2
  • Configure to report to the monitoring pool
  • Test submission from uc3-sub

Todo:

Install

Requirements:
  • host that can submit to local PBS cluster
  • user with login access
  • users must be authorized to submit to PBS

Host itb2.uchciago.edu can submit jobs to itbv-pbs.uchicago.edu. Jobs will run as user uc3.

Installation done following https://twiki.grid.iu.edu/bin/view/Documentation/CampusFactoryInstall

NOTE: Initial problems were solved changing the port of the collector. It was interfering with the local Condor installation.

Install dump

[uc3@itb2 ~]$ ls -al
total 32
drwx------  2 uc3  osgvo 4096 Mar  2 16:10 .
drwxr-xr-x 55 root root  4096 Mar  2 16:10 ..
-rw-r--r--  1 uc3  osgvo   18 May 26  2011 .bash_logout
-rw-r--r--  1 uc3  osgvo  176 May 26  2011 .bash_profile
-rw-r--r--  1 uc3  osgvo  124 May 26  2011 .bashrc
-rw-r--r--  1 uc3  osgvo  500 Jan 23  2007 .emacs
-rw-r--r--  1 uc3  osgvo  121 Jan  6 10:04 .kshrc
-rw-r--r--  1 uc3  osgvo  658 Nov 24  2010 .zshrc
[uc3@itb2 ~]$ mkdir cf
[uc3@itb2 ~]$ mkdir ~/cf/condor
[uc3@itb2 ~]$ wget http://parrot.cs.wisc.edu//symlink/20120306161501/7/7.7/7.7.5/70ee28800249c9d3b3bb4bd88eedb5ad/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz
--2012-03-06 18:58:25--  http://parrot.cs.wisc.edu//symlink/20120306161501/7/7.7/7.7.5/70ee28800249c9d3b3bb4bd88eedb5ad/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz
Resolving parrot.cs.wisc.edu... 128.105.121.59
Connecting to parrot.cs.wisc.edu|128.105.121.59|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49489882 (47M) [application/x-gzip]
Saving to: “condor-7.7.5-x86_64_rhap_5-stripped.tar.gz”

100%[===============================================================================>] 49,489,882  73.9M/s   in 0.6s    

2012-03-06 18:58:26 (73.9 MB/s) - “condor-7.7.5-x86_64_rhap_5-stripped.tar.gz” saved [49489882/49489882]

[uc3@itb2 ~]$ mkdir /tmp/condor-src
[uc3@itb2 ~]$ cd /tmp/condor-src
[uc3@itb2 condor-src]$ tar xzf ~/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz 
[uc3@itb2 condor-src]$ cd condor-7.7.5-x86_64_rhap_5-stripped/
[uc3@itb2 condor-7.7.5-x86_64_rhap_5-stripped]$ ./condor_install --prefix=$HOME/cf/condor 
Installing Condor from /tmp/condor-src/condor-7.7.5-x86_64_rhap_5-stripped to /share/home/osgvo/uc3/cf/condor

Unable to find a valid Java installation 
Java Universe will not work properly until the JAVA 
(and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!

Condor has been installed into:
    /share/home/osgvo/uc3/cf/condor

Configured condor using these configuration files:
  global: /share/home/osgvo/uc3/cf/condor/etc/condor_config
  local:  /share/home/osgvo/uc3/cf/condor/local.itb2/condor_config.local

In order for Condor to work properly you must set your CONDOR_CONFIG
environment variable to point to your Condor configuration file:
/share/home/osgvo/uc3/cf/condor/etc/condor_config before running Condor
commands/daemons.
Created scripts which can be sourced by users to setup their
Condor environment variables.  These are:
   sh: /share/home/osgvo/uc3/cf/condor/condor.sh
  csh: /share/home/osgvo/uc3/cf/condor/condor.csh

[uc3@itb2 condor-7.7.5-x86_64_rhap_5-stripped]$ cd ~/cf/
[uc3@itb2 cf]$ wget http://sourceforge.net/projects/campusfactory/files/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz/download
--2012-03-06 19:02:41--  http://sourceforge.net/projects/campusfactory/files/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz/download
Resolving sourceforge.net... 216.34.181.60
Connecting to sourceforge.net|216.34.181.60|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://downloads.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz?r=&ts=1331082161&use_mirror=iweb [following]
--2012-03-06 19:02:41--  http://downloads.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz?r=&ts=1331082161&use_mirror=iweb
Resolving downloads.sourceforge.net... 216.34.181.59
Connecting to downloads.sourceforge.net|216.34.181.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://iweb.dl.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz [following]
--2012-03-06 19:02:42--  http://iweb.dl.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz
Resolving iweb.dl.sourceforge.net... 70.38.0.134, 2607:f748:10:12::5f:2
Connecting to iweb.dl.sourceforge.net|70.38.0.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8199915 (7.8M) [application/x-gzip]
Saving to: “CampusFactory-0.4.3.tar.gz”

100%[===============================================================================>] 8,199,915    914K/s   in 9.4s    

2012-03-06 19:02:52 (854 KB/s) - “CampusFactory-0.4.3.tar.gz” saved [8199915/8199915]

[uc3@itb2 cf]$ tar xzf CampusFactory-0.4.3.tar.gz 
[uc3@itb2 cf]$ export CONDOR_LOCATION=~/cf/condor
[uc3@itb2 cf]$ export FACTORY_LOCATION=~/cf/CampusFactory-0.4.3
[uc3@itb2 cf]$ mkdir $CONDOR_LOCATION/etc/config.d
[uc3@itb2 cf]$ pwd
/home/osgvo/uc3/cf
[uc3@itb2 cf]$ vi condor/etc/condor_config 
[uc3@itb2 cf]$ . condor/condor.sh      
[uc3@itb2 cf]$ vi condor/etc/condor_config 
[uc3@itb2 cf]$ condor_config_val local_config_dir
/share/home/osgvo/uc3/cf/condor/etc/config.d
[uc3@itb2 cf]$ cp $FACTORY_LOCATION/share/condor/condor_config.factory $CONDOR_LOCATION/etc/config.d/condor_config.factory
[uc3@itb2 cf]$ cp $FACTORY_LOCATION/share/condor/condor_mapfile $CONDOR_LOCATION/etc/condor_mapfile
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/etc/config.d/condor_config.factory            
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/local.itb2/condor_config.local 
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/etc/config.d/condor_config.factory 
[uc3@itb2 cf]$ vi ~/.forward
[uc3@itb2 cf]$ vi  $CONDOR_LOCATION/libexec/glite/etc/batch_gahp.config
[uc3@itb2 cf]$ source $CONDOR_LOCATION/condor.sh
[uc3@itb2 cf]$ condor_master
[uc3@itb2 cf]$ condor_q


-- Submitter: itb2.uchicago.edu : <10.1.5.107:42403> : itb2.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[uc3@itb2 cf]$ vi condor/etc/config.d/condor_config.factory 
[uc3@itb2 cf]$ mkdir  /tmp/cf-logs
[uc3@itb2 cf]$ ps aux | grep condor
condor    2438  0.0  0.0  32640  5344 ?        Ss   Feb22   2:17 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
condor    2450  0.0  0.0  33208  5040 ?        Ss   Feb22   0:12 condor_schedd -f
root      2458  0.0  0.0  25212  3100 ?        S    Feb22   2:05 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 19002
uc3      23912  0.0  0.0  99420  6284 ?        Ss   19:56   0:00 condor_master
uc3      23913  0.0  0.0  99368  7004 ?        Ss   19:56   0:00 condor_collector -f
uc3      23914  0.0  0.0 101528  7732 ?        Ss   19:56   0:00 condor_schedd -f
uc3      23915  0.0  0.0 101032  6464 ?        Ss   19:56   0:00 condor_negotiator -f
uc3      23916  0.0  0.0  27152  2592 ?        S    19:56   0:00 condor_procd -A /tmp/condor-lock.itb20.765848645315604/procd_pipe.SCHEDD -L /share/home/osgvo/uc3/cf/condor/local.itb2/log/ProcLog.SCHEDD -R 10000000 -S 60 -C 21064
uc3      23971  0.0  0.0 103304   848 pts/1    S+   20:06   0:00 grep condor
[uc3@itb2 cf]$ export PYTHONPATH=$PYTHONPATH:$FACTORY_LOCATION/python-lib
[uc3@itb2 cf]$ export PATH=$PATH:$FACTORY_LOCATION/bin 
[uc3@itb2 cf]$ cd $FACTORY_LOCATION
Campus factory was interferring with local Condor. Moved on a different port

New session:
[uc3@itb2 ~]$ vi cf/condor/local.itb2/condor_config.local 
[uc3@itb2 ~]$ which condor_q
/share/home/osgvo/uc3/cf/condor/bin/condor_q
[uc3@itb2 ~]$ condor_restart 
Can't find address for local master
Perhaps you need to query another pool.
[uc3@itb2 ~]$ condor_master
[uc3@itb2 ~]$ condor_config_val collector_host
itb2.uchicago.edu:39618

[uc3@itb2 cf]$ vi cf-setup.sh
[uc3@itb2 cf]$ mv cf-setup.sh variables.sh       
[uc3@itb2 cf]$ cat variables.sh condor/condor.sh > cf-setup.sh

[uc3@itb2 cf]$  . ./cf-setup.sh 
[uc3@itb2 CampusFactory-0.4.3]$ condor_submit ./share/factory.job
Submitting job(s).
1 job(s) submitted to cluster 3.
[uc3@itb2 CampusFactory-0.4.3]$ condor_q


-- Submitter: itb2.uchicago.edu : <10.1.5.107:52593> : itb2.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

[uc3@itb2 cf]$ vi ./cf-setup.sh 
[uc3@itb2 cf]$ . ./cf-setup.sh 
[uc3@itb2 cf]$ cd $FACTORY_LOCATION
[uc3@itb2 CampusFactory-0.4.3]$ cf start
Starting Campus Factory:                                   [  OK  ]
[uc3@itb2 CampusFactory-0.4.3]$ condor_q


-- Submitter: itb2.uchicago.edu : <10.1.5.107:52593> : itb2.uchicago.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   4.0   uc3             3/7  00:23   0+00:00:33 R  0   0.0  runfactory -c etc/

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
[uc3@itb2 CampusFactory-0.4.3]$ ls
bin  etc  python-lib  share
[uc3@itb2 CampusFactory-0.4.3]$ condor_status -any

MyType               TargetType           Name                          

Collector            None                 UcItbForUc3@itb2.uchicago.edu 
Scheduler            None                 itb2.uchicago.edu             
DaemonMaster         None                 itb2.uchicago.edu             
Negotiator           None                 itb2.uchicago.edu             
Submitter            None                 uc3@uc-itb.uc3.org            

Files

condor config from factory:
# Initial part removed because overridden

# These have to stay here (and not in the local config) because are parsed by the campusfactory
# What hosts can run jobs to this cluster.
FLOCK_FROM = uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, ui-cr.uchicago.edu, ui-cr.mwt2.org

# Jobs submitted here can run at.
FLOCK_TO = itb.mwt2.org

# Internal ip addresses of the cluster
INTERNAL_IPS = 10.1.3.* 10.1.4.* 10.1.5 itb-c*.mwt2.org 128.135.158.241 uct2-6509.uchicago.edu

##############################################
# Things that are 'safe' to leave
#

# Where the certificate mapfile is located.
CERTIFICATE_MAPFILE=$(RELEASE_DIR)/etc/condor_mapfile

# What daemons should I run?
DAEMON_LIST = COLLECTOR, SCHEDD, NEGOTIATOR, MASTER

# Location of the PBS_GAHP to be used to submit the glideins.
GLITE_LOCATION = $(LIBEXEC)/glite
PBS_GAHP       = $(GLITE_LOCATION)/bin/batch_gahp

# Remove glidein jobs that get put on hold for over 24 hours.
SYSTEM_PERIODIC_REMOVE = (GlideinJob == TRUE && JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*24*1)

#
# Security definitions
#
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE

SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE

ALLOW_WRITE = $(FLOCK_FROM) $(FLOCK_TO) execute-side@matchsession $(INTERNAL_IPS) $(HOSTNAME)
ALLOW_READ = $(ALLOW_WRITE)

SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED

ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)

#DENY_WRITE = anonymous@*
#ALLOW_WRITE = *@ff.unl.edu *@prairiefire.unl.edu *@glidein.unl.edu execute-side@matchsession anonymous@claimtobe/10.158.*
#DENY_ADVERTISE_SCHEDD = *
#DENY_ADMINISTRATOR = anonymous@*
#ALLOW_ADMINISTRATOR = dweitzel@ff.unl.edu/ff-grid.unl.edu anonymous@claimtobe/ff.unl.edu
#DENY_DAEMON = anonymous@*
#ALLOW_DAEMON = $(ALLOW_ADMINISTRATOR) *@claimtobe/prairiefire.unl.edu *@claimtobe/glidein.unl.edu *@claimtobe/ff.unl.edu execute-side@matchsession *@claimtobe/129.93.227.* anonymous@claimtobe/10.158.*
#DENY_NEGOTIATOR = anonymous@glidein.unl.edu
#ALLOW_NEGOTIATOR = */ff-grid.unl.edu */prairiefire.unl.edu 

#INTERNAL_FF = */10.158.* */ff-grid.unl.edu */129.93.227.* 
#ALLOW_ADVERTISE_STARTD = $(INTERNAL_FF)
#ALLOW_ADVERTISE_MASTER = $(INTERNAL_FF)
#ALLOW_DAEMON = $(ALLOW_DAEMON) $(INTERNAL_FF)

#ALLOW_ADVERTISE_SCHEDD = */ff-grid.unl.edu $(FLOCK_FROM) */10.158.50.2

#SEC_READ_INTEGRITY = OPTIONAL
#SEC_CLIENT_INTEGRITY = OPTIONAL
#SEC_READ_ENCRYPTION = OPTIONAL
#SEC_CLIENT_ENCRYPTION = OPTIONAL

condor_config.local
##  What machine is your central manager?

CONDOR_HOST = itb2.uchicago.edu

##  Pathnames:
##  Where have you installed the bin, sbin and lib condor directories?   

RELEASE_DIR = /share/home/osgvo/uc3/cf/condor


##  Where is the local condor directory for each host?  
##  This is where the local config file(s), logs and
##  spool/execute directories are located

LOCAL_DIR = /share/home/osgvo/uc3/cf/condor/local.$(HOSTNAME)


##  Mail parameters:
##  When something goes wrong with condor at your site, who should get
##  the email?

CONDOR_ADMIN = uc3@itb2.uchicago.edu


##  Full path to a mail delivery program that understands that "-s"
##  means you want to specify a subject:

MAIL = /bin/mailx


##  Network domain parameters:
##  Internet domain of machines sharing a common UID space.  If your
##  machines don't share a common UID space, set it to 
##  UID_DOMAIN = $(FULL_HOSTNAME)
##  to specify that each machine has its own UID space.

UID_DOMAIN = uc-itb.uc3.org


##  Internet domain of machines sharing a common file system.
##  If your machines don't use a network file system, set it to
##  FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
##  to specify that each machine has its own file system. 

FILESYSTEM_DOMAIN = $(UID_DOMAIN)


##  This macro is used to specify a short description of your pool. 
##  It should be about 20 characters long. For example, the name of 
##  the UW-Madison Computer Science Condor Pool is ``UW-Madison CS''.

COLLECTOR_NAME = UcItbForUc3


##  The user/group ID . of the "Condor" user. 
##  (this can also be specified in the environment)
##  Note: the CONDOR_IDS setting is ignored on Win32 platforms

CONDOR_IDS = 21064.21000


##  Condor needs to create a few lock files to synchronize access to
##  various log files.  Because of problems we've had with network
##  filesystems and file locking over the years, we HIGHLY recommend
##  that you put these lock files on a local partition on each
##  machine.  If you don't have your LOCAL_DIR on a local partition,
##  be sure to change this entry.  Whatever user (or group) condor is
##  running as needs to have write access to this directory.  If
##  you're not running as root, this is whatever user you started up
##  the condor_master as.  If you are running as root, and there's a
##  condor account, it's probably condor.  Otherwise, it's whatever
##  you've set in the CONDOR_IDS environment variable.  See the Admin
##  manual for details on this.

LOCK = /tmp/condor-lock.$(HOSTNAME)0.765848645315604


##  When is this machine willing to start a job? 

START = TRUE


##  When to suspend a job?

SUSPEND = FALSE


##  When to nicely stop a job?
##  (as opposed to killing it instantaneously)

PREEMPT = FALSE


##  When to instantaneously kill a preempting job
##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

#DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

## MM
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
#FLOCK_NEGOTIATOR_HOSTS = 
#FLOCK_COLLECTOR_HOSTS = 
#ALLOW_WRITE = itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org, uct3-edge5.uchicago.edu, login2.pads.ci.uchicago.edu pads.ci.uchicago.edu execute-side@matchsession 192.5.86.* 172.5.86.* pads.ci.uchicago.edu 128.135.125.142 login2
#ALLOW_WRITE = itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org
#ALLOW_WRITE = $(CONDOR_HOST), 128.135.158.241, itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org
## To be CCB for PADS must allow write from PADS as well (all WN):
#ALLOW_WRITE = $(CONDOR_HOST), 128.135.158.241, itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org, *.pads.ci.uchicago.edu, 192.5.86.*
#
###
#BIND_ALL_INTERFACES = True
#NETWORK_INTERFACE = ip to use
COLLECTOR_HOST = $(CONDOR_HOST):39618

CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618

Problems

uc3 could not submit jobs

Difficult to understand until ou do not submit directly to PBS as user uc3

Solution: add uc3 to the list of authorized users
[root@itbv-pbs ~]# qmgr -c 'p s' | grep uc3
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@itb*.uchicago.edu'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@itb*.mwt2.org'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@vtb*.uchicago.edu'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@vtb*.mwt2.org'
[root@itbv-pbs ~]# qmgr -c 'p s' | grep uc3
set server authorized_users += uc3@vtb*.uchicago.edu
set server authorized_users += uc3@itb*.uchicago.edu
set server authorized_users += uc3@vtb*.mwt2.org
set server authorized_users += uc3@itb*.mwt2.org

# and to make it persistent in Puppet:
vi ./modules/pbs/files/pbs_user_authorizations

Collector on a non standard port

Since there is another condor running on the host, the collector for the Campus Factory is running on a non standard port: itb2.uchicago.edu:39618 Adding "itb2.uchicago.edu:39618" to FLOCK_TO in the Schedd is not sufficient.

Solution: add only the hostname (itb2.uchicago.edu) to the ALLOW_NEGOTIATOR_SCHEDD list:
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org

-- MarcoMambelli - 08 Mar 2012
Topic revision: r4 - 05 Apr 2012, MarcoMambelli
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback