CampusFactoryUcItbPbs
Intro
This page captures some admin notes related to setting up an Campus Factory gateway to the OSG Integration Testbed cluster (ITB).
Status
Done:
- Campus Factory has been installed on
itb2.uchciago.edu
under the user uc3
and starts correctly.
- Collector is:
itb2.uchicago.edu:39618
- Configured to accept jobs from uc3-sub
- Submission from ib2
- Configure to report to the monitoring pool
- Test submission from uc3-sub
Todo:
Install
Requirements:
- host that can submit to local PBS cluster
- user with login access
- users must be authorized to submit to PBS
Host
itb2.uchciago.edu
can submit jobs to
itbv-pbs.uchicago.edu
. Jobs will run as user
uc3
.
Installation done following
https://twiki.grid.iu.edu/bin/view/Documentation/CampusFactoryInstall
NOTE: Initial problems were solved changing the port of the collector. It was interfering with the local Condor installation.
Install dump
[uc3@itb2 ~]$ ls -al
total 32
drwx------ 2 uc3 osgvo 4096 Mar 2 16:10 .
drwxr-xr-x 55 root root 4096 Mar 2 16:10 ..
-rw-r--r-- 1 uc3 osgvo 18 May 26 2011 .bash_logout
-rw-r--r-- 1 uc3 osgvo 176 May 26 2011 .bash_profile
-rw-r--r-- 1 uc3 osgvo 124 May 26 2011 .bashrc
-rw-r--r-- 1 uc3 osgvo 500 Jan 23 2007 .emacs
-rw-r--r-- 1 uc3 osgvo 121 Jan 6 10:04 .kshrc
-rw-r--r-- 1 uc3 osgvo 658 Nov 24 2010 .zshrc
[uc3@itb2 ~]$ mkdir cf
[uc3@itb2 ~]$ mkdir ~/cf/condor
[uc3@itb2 ~]$ wget http://parrot.cs.wisc.edu//symlink/20120306161501/7/7.7/7.7.5/70ee28800249c9d3b3bb4bd88eedb5ad/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz
--2012-03-06 18:58:25-- http://parrot.cs.wisc.edu//symlink/20120306161501/7/7.7/7.7.5/70ee28800249c9d3b3bb4bd88eedb5ad/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz
Resolving parrot.cs.wisc.edu... 128.105.121.59
Connecting to parrot.cs.wisc.edu|128.105.121.59|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49489882 (47M) [application/x-gzip]
Saving to: “condor-7.7.5-x86_64_rhap_5-stripped.tar.gz”
100%[===============================================================================>] 49,489,882 73.9M/s in 0.6s
2012-03-06 18:58:26 (73.9 MB/s) - “condor-7.7.5-x86_64_rhap_5-stripped.tar.gz” saved [49489882/49489882]
[uc3@itb2 ~]$ mkdir /tmp/condor-src
[uc3@itb2 ~]$ cd /tmp/condor-src
[uc3@itb2 condor-src]$ tar xzf ~/condor-7.7.5-x86_64_rhap_5-stripped.tar.gz
[uc3@itb2 condor-src]$ cd condor-7.7.5-x86_64_rhap_5-stripped/
[uc3@itb2 condor-7.7.5-x86_64_rhap_5-stripped]$ ./condor_install --prefix=$HOME/cf/condor
Installing Condor from /tmp/condor-src/condor-7.7.5-x86_64_rhap_5-stripped to /share/home/osgvo/uc3/cf/condor
Unable to find a valid Java installation
Java Universe will not work properly until the JAVA
(and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!
Condor has been installed into:
/share/home/osgvo/uc3/cf/condor
Configured condor using these configuration files:
global: /share/home/osgvo/uc3/cf/condor/etc/condor_config
local: /share/home/osgvo/uc3/cf/condor/local.itb2/condor_config.local
In order for Condor to work properly you must set your CONDOR_CONFIG
environment variable to point to your Condor configuration file:
/share/home/osgvo/uc3/cf/condor/etc/condor_config before running Condor
commands/daemons.
Created scripts which can be sourced by users to setup their
Condor environment variables. These are:
sh: /share/home/osgvo/uc3/cf/condor/condor.sh
csh: /share/home/osgvo/uc3/cf/condor/condor.csh
[uc3@itb2 condor-7.7.5-x86_64_rhap_5-stripped]$ cd ~/cf/
[uc3@itb2 cf]$ wget http://sourceforge.net/projects/campusfactory/files/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz/download
--2012-03-06 19:02:41-- http://sourceforge.net/projects/campusfactory/files/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz/download
Resolving sourceforge.net... 216.34.181.60
Connecting to sourceforge.net|216.34.181.60|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://downloads.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz?r=&ts=1331082161&use_mirror=iweb [following]
--2012-03-06 19:02:41-- http://downloads.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz?r=&ts=1331082161&use_mirror=iweb
Resolving downloads.sourceforge.net... 216.34.181.59
Connecting to downloads.sourceforge.net|216.34.181.59|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://iweb.dl.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz [following]
--2012-03-06 19:02:42-- http://iweb.dl.sourceforge.net/project/campusfactory/CampusFactory-0.4.3/CampusFactory-0.4.3.tar.gz
Resolving iweb.dl.sourceforge.net... 70.38.0.134, 2607:f748:10:12::5f:2
Connecting to iweb.dl.sourceforge.net|70.38.0.134|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8199915 (7.8M) [application/x-gzip]
Saving to: “CampusFactory-0.4.3.tar.gz”
100%[===============================================================================>] 8,199,915 914K/s in 9.4s
2012-03-06 19:02:52 (854 KB/s) - “CampusFactory-0.4.3.tar.gz” saved [8199915/8199915]
[uc3@itb2 cf]$ tar xzf CampusFactory-0.4.3.tar.gz
[uc3@itb2 cf]$ export CONDOR_LOCATION=~/cf/condor
[uc3@itb2 cf]$ export FACTORY_LOCATION=~/cf/CampusFactory-0.4.3
[uc3@itb2 cf]$ mkdir $CONDOR_LOCATION/etc/config.d
[uc3@itb2 cf]$ pwd
/home/osgvo/uc3/cf
[uc3@itb2 cf]$ vi condor/etc/condor_config
[uc3@itb2 cf]$ . condor/condor.sh
[uc3@itb2 cf]$ vi condor/etc/condor_config
[uc3@itb2 cf]$ condor_config_val local_config_dir
/share/home/osgvo/uc3/cf/condor/etc/config.d
[uc3@itb2 cf]$ cp $FACTORY_LOCATION/share/condor/condor_config.factory $CONDOR_LOCATION/etc/config.d/condor_config.factory
[uc3@itb2 cf]$ cp $FACTORY_LOCATION/share/condor/condor_mapfile $CONDOR_LOCATION/etc/condor_mapfile
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/etc/config.d/condor_config.factory
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/local.itb2/condor_config.local
[uc3@itb2 cf]$ vi /home/osgvo/uc3/cf/condor/etc/config.d/condor_config.factory
[uc3@itb2 cf]$ vi ~/.forward
[uc3@itb2 cf]$ vi $CONDOR_LOCATION/libexec/glite/etc/batch_gahp.config
[uc3@itb2 cf]$ source $CONDOR_LOCATION/condor.sh
[uc3@itb2 cf]$ condor_master
[uc3@itb2 cf]$ condor_q
-- Submitter: itb2.uchicago.edu : <10.1.5.107:42403> : itb2.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[uc3@itb2 cf]$ vi condor/etc/config.d/condor_config.factory
[uc3@itb2 cf]$ mkdir /tmp/cf-logs
[uc3@itb2 cf]$ ps aux | grep condor
condor 2438 0.0 0.0 32640 5344 ? Ss Feb22 2:17 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
condor 2450 0.0 0.0 33208 5040 ? Ss Feb22 0:12 condor_schedd -f
root 2458 0.0 0.0 25212 3100 ? S Feb22 2:05 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 19002
uc3 23912 0.0 0.0 99420 6284 ? Ss 19:56 0:00 condor_master
uc3 23913 0.0 0.0 99368 7004 ? Ss 19:56 0:00 condor_collector -f
uc3 23914 0.0 0.0 101528 7732 ? Ss 19:56 0:00 condor_schedd -f
uc3 23915 0.0 0.0 101032 6464 ? Ss 19:56 0:00 condor_negotiator -f
uc3 23916 0.0 0.0 27152 2592 ? S 19:56 0:00 condor_procd -A /tmp/condor-lock.itb20.765848645315604/procd_pipe.SCHEDD -L /share/home/osgvo/uc3/cf/condor/local.itb2/log/ProcLog.SCHEDD -R 10000000 -S 60 -C 21064
uc3 23971 0.0 0.0 103304 848 pts/1 S+ 20:06 0:00 grep condor
[uc3@itb2 cf]$ export PYTHONPATH=$PYTHONPATH:$FACTORY_LOCATION/python-lib
[uc3@itb2 cf]$ export PATH=$PATH:$FACTORY_LOCATION/bin
[uc3@itb2 cf]$ cd $FACTORY_LOCATION
Campus factory was interferring with local Condor. Moved on a different port
New session:
[uc3@itb2 ~]$ vi cf/condor/local.itb2/condor_config.local
[uc3@itb2 ~]$ which condor_q
/share/home/osgvo/uc3/cf/condor/bin/condor_q
[uc3@itb2 ~]$ condor_restart
Can't find address for local master
Perhaps you need to query another pool.
[uc3@itb2 ~]$ condor_master
[uc3@itb2 ~]$ condor_config_val collector_host
itb2.uchicago.edu:39618
[uc3@itb2 cf]$ vi cf-setup.sh
[uc3@itb2 cf]$ mv cf-setup.sh variables.sh
[uc3@itb2 cf]$ cat variables.sh condor/condor.sh > cf-setup.sh
[uc3@itb2 cf]$ . ./cf-setup.sh
[uc3@itb2 CampusFactory-0.4.3]$ condor_submit ./share/factory.job
Submitting job(s).
1 job(s) submitted to cluster 3.
[uc3@itb2 CampusFactory-0.4.3]$ condor_q
-- Submitter: itb2.uchicago.edu : <10.1.5.107:52593> : itb2.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[uc3@itb2 cf]$ vi ./cf-setup.sh
[uc3@itb2 cf]$ . ./cf-setup.sh
[uc3@itb2 cf]$ cd $FACTORY_LOCATION
[uc3@itb2 CampusFactory-0.4.3]$ cf start
Starting Campus Factory: [ OK ]
[uc3@itb2 CampusFactory-0.4.3]$ condor_q
-- Submitter: itb2.uchicago.edu : <10.1.5.107:52593> : itb2.uchicago.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
4.0 uc3 3/7 00:23 0+00:00:33 R 0 0.0 runfactory -c etc/
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
[uc3@itb2 CampusFactory-0.4.3]$ ls
bin etc python-lib share
[uc3@itb2 CampusFactory-0.4.3]$ condor_status -any
MyType TargetType Name
Collector None UcItbForUc3@itb2.uchicago.edu
Scheduler None itb2.uchicago.edu
DaemonMaster None itb2.uchicago.edu
Negotiator None itb2.uchicago.edu
Submitter None uc3@uc-itb.uc3.org
Files
condor config from factory:
# Initial part removed because overridden
# These have to stay here (and not in the local config) because are parsed by the campusfactory
# What hosts can run jobs to this cluster.
FLOCK_FROM = uc3-cloud.uchicago.edu, uc3-cloud.mwt2.org, uc3-sub.uchicago.edu, uc3-sub.mwt2.org, ui-cr.uchicago.edu, ui-cr.mwt2.org
# Jobs submitted here can run at.
FLOCK_TO = itb.mwt2.org
# Internal ip addresses of the cluster
INTERNAL_IPS = 10.1.3.* 10.1.4.* 10.1.5 itb-c*.mwt2.org 128.135.158.241 uct2-6509.uchicago.edu
##############################################
# Things that are 'safe' to leave
#
# Where the certificate mapfile is located.
CERTIFICATE_MAPFILE=$(RELEASE_DIR)/etc/condor_mapfile
# What daemons should I run?
DAEMON_LIST = COLLECTOR, SCHEDD, NEGOTIATOR, MASTER
# Location of the PBS_GAHP to be used to submit the glideins.
GLITE_LOCATION = $(LIBEXEC)/glite
PBS_GAHP = $(GLITE_LOCATION)/bin/batch_gahp
# Remove glidein jobs that get put on hold for over 24 hours.
SYSTEM_PERIODIC_REMOVE = (GlideinJob == TRUE && JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*24*1)
#
# Security definitions
#
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = TRUE
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION = PREFERRED
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
ALLOW_WRITE = $(FLOCK_FROM) $(FLOCK_TO) execute-side@matchsession $(INTERNAL_IPS) $(HOSTNAME)
ALLOW_READ = $(ALLOW_WRITE)
SEC_DEFAULT_ENCRYPTION = OPTIONAL
SEC_DEFAULT_INTEGRITY = REQUIRED
ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME)
#DENY_WRITE = anonymous@*
#ALLOW_WRITE = *@ff.unl.edu *@prairiefire.unl.edu *@glidein.unl.edu execute-side@matchsession anonymous@claimtobe/10.158.*
#DENY_ADVERTISE_SCHEDD = *
#DENY_ADMINISTRATOR = anonymous@*
#ALLOW_ADMINISTRATOR = dweitzel@ff.unl.edu/ff-grid.unl.edu anonymous@claimtobe/ff.unl.edu
#DENY_DAEMON = anonymous@*
#ALLOW_DAEMON = $(ALLOW_ADMINISTRATOR) *@claimtobe/prairiefire.unl.edu *@claimtobe/glidein.unl.edu *@claimtobe/ff.unl.edu execute-side@matchsession *@claimtobe/129.93.227.* anonymous@claimtobe/10.158.*
#DENY_NEGOTIATOR = anonymous@glidein.unl.edu
#ALLOW_NEGOTIATOR = */ff-grid.unl.edu */prairiefire.unl.edu
#INTERNAL_FF = */10.158.* */ff-grid.unl.edu */129.93.227.*
#ALLOW_ADVERTISE_STARTD = $(INTERNAL_FF)
#ALLOW_ADVERTISE_MASTER = $(INTERNAL_FF)
#ALLOW_DAEMON = $(ALLOW_DAEMON) $(INTERNAL_FF)
#ALLOW_ADVERTISE_SCHEDD = */ff-grid.unl.edu $(FLOCK_FROM) */10.158.50.2
#SEC_READ_INTEGRITY = OPTIONAL
#SEC_CLIENT_INTEGRITY = OPTIONAL
#SEC_READ_ENCRYPTION = OPTIONAL
#SEC_CLIENT_ENCRYPTION = OPTIONAL
condor_config.local
## What machine is your central manager?
CONDOR_HOST = itb2.uchicago.edu
## Pathnames:
## Where have you installed the bin, sbin and lib condor directories?
RELEASE_DIR = /share/home/osgvo/uc3/cf/condor
## Where is the local condor directory for each host?
## This is where the local config file(s), logs and
## spool/execute directories are located
LOCAL_DIR = /share/home/osgvo/uc3/cf/condor/local.$(HOSTNAME)
## Mail parameters:
## When something goes wrong with condor at your site, who should get
## the email?
CONDOR_ADMIN = uc3@itb2.uchicago.edu
## Full path to a mail delivery program that understands that "-s"
## means you want to specify a subject:
MAIL = /bin/mailx
## Network domain parameters:
## Internet domain of machines sharing a common UID space. If your
## machines don't share a common UID space, set it to
## UID_DOMAIN = $(FULL_HOSTNAME)
## to specify that each machine has its own UID space.
UID_DOMAIN = uc-itb.uc3.org
## Internet domain of machines sharing a common file system.
## If your machines don't use a network file system, set it to
## FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
## to specify that each machine has its own file system.
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
## This macro is used to specify a short description of your pool.
## It should be about 20 characters long. For example, the name of
## the UW-Madison Computer Science Condor Pool is ``UW-Madison CS''.
COLLECTOR_NAME = UcItbForUc3
## The user/group ID . of the "Condor" user.
## (this can also be specified in the environment)
## Note: the CONDOR_IDS setting is ignored on Win32 platforms
CONDOR_IDS = 21064.21000
## Condor needs to create a few lock files to synchronize access to
## various log files. Because of problems we've had with network
## filesystems and file locking over the years, we HIGHLY recommend
## that you put these lock files on a local partition on each
## machine. If you don't have your LOCAL_DIR on a local partition,
## be sure to change this entry. Whatever user (or group) condor is
## running as needs to have write access to this directory. If
## you're not running as root, this is whatever user you started up
## the condor_master as. If you are running as root, and there's a
## condor account, it's probably condor. Otherwise, it's whatever
## you've set in the CONDOR_IDS environment variable. See the Admin
## manual for details on this.
LOCK = /tmp/condor-lock.$(HOSTNAME)0.765848645315604
## When is this machine willing to start a job?
START = TRUE
## When to suspend a job?
SUSPEND = FALSE
## When to nicely stop a job?
## (as opposed to killing it instantaneously)
PREEMPT = FALSE
## When to instantaneously kill a preempting job
## (e.g. if a job is in the pre-empting stage for too long)
KILL = FALSE
#DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
## MM
SEC_DEFAULT_AUTHENTICATION_METHODS = FS,CLAIMTOBE
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = True
#FLOCK_NEGOTIATOR_HOSTS =
#FLOCK_COLLECTOR_HOSTS =
#ALLOW_WRITE = itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org, uct3-edge5.uchicago.edu, login2.pads.ci.uchicago.edu pads.ci.uchicago.edu execute-side@matchsession 192.5.86.* 172.5.86.* pads.ci.uchicago.edu 128.135.125.142 login2
#ALLOW_WRITE = itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org
#ALLOW_WRITE = $(CONDOR_HOST), 128.135.158.241, itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org
## To be CCB for PADS must allow write from PADS as well (all WN):
#ALLOW_WRITE = $(CONDOR_HOST), 128.135.158.241, itbv-pbs.mwt2.org, itb4.uchicago.edu, itb4.mwt2.org, ui-gwms.uchicago.edu, ui-gwms.mwt2.org, *.pads.ci.uchicago.edu, 192.5.86.*
#
###
#BIND_ALL_INTERFACES = True
#NETWORK_INTERFACE = ip to use
COLLECTOR_HOST = $(CONDOR_HOST):39618
CONDOR_VIEW_HOST = uc3-cloud.uchicago.edu:39618
Problems
uc3 could not submit jobs
Difficult to understand until ou do not submit directly to PBS as user uc3
Solution: add uc3 to the list of authorized users
[root@itbv-pbs ~]# qmgr -c 'p s' | grep uc3
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@itb*.uchicago.edu'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@itb*.mwt2.org'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@vtb*.uchicago.edu'
[root@itbv-pbs ~]# qmgr -c 'set server authorized_users+=uc3@vtb*.mwt2.org'
[root@itbv-pbs ~]# qmgr -c 'p s' | grep uc3
set server authorized_users += uc3@vtb*.uchicago.edu
set server authorized_users += uc3@itb*.uchicago.edu
set server authorized_users += uc3@vtb*.mwt2.org
set server authorized_users += uc3@itb*.mwt2.org
# and to make it persistent in Puppet:
vi ./modules/pbs/files/pbs_user_authorizations
Collector on a non standard port
Since there is another condor running on the host, the collector for the Campus Factory is running on a non standard port:
itb2.uchicago.edu:39618
Adding "itb2.uchicago.edu:39618" to FLOCK_TO in the Schedd is not sufficient.
Solution: add only the hostname (itb2.uchicago.edu) to the ALLOW_NEGOTIATOR_SCHEDD list:
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), 128.135.158.225, itb2.uchicago.edu, itb2.mwt2.org
--
MarcoMambelli - 08 Mar 2012