SL6Migration
SL6 Migration
USAtlas is scheduled to being migration of all Tier2 to SL6 starting June 1 with completion by June 30 (WLCG schedule is August).
After some initial testing, a plan has been developed as to how to migrate a site such as MWT2.
Initial testing has shown that it is not possible to reliably run a site in a mixed SL5/SL6 configuration.
The recommend procedure by Atlas at
SLC6Readiness is to either convert an
entire site in a "Big Bang" or by a "Rolling Transition".
The following plans are to try and move MWT2 via a "Rolling Transition".
Rolling Transition one regional site at a time
The easiest and safest way to upgrade MWT2 is via a rolling transition. Each regional site (UChicago, Indiana, Illinois) can be upgraded
separately from the others. In this way only part of the MWT2 site will ever be down for an extended period while the worker nodes, etc
are upgraded to SL6. Also, should a problem develop with the SL6 deployment, part of the site will remain with SL5 capabilities.
Since each regional site has its own gatekeeper, condor head node and condor pool of worker nodes,
upgrading each site individually is an easy, less stressful procedure.
To perform a rolling transition, we need to take the following steps.
Create two new Panda Qs associated only with an SL6 GK
Clone MWT2 as MWT2_SL6 and ANALY_MWT2 as ANALY_MWT2_SL6
These clones would then associate only with the SL6 gatekeeper.
Create an "SL6" enabled gatekeeper
This gatekeeper will advertize to the BDII only SL6 validated releases
The $APP (and grid3-locations) would be different that those on the SL5 nodes
Pilots which glide into this node will then be run only on SL6 compute nodes.
Setup a new validation in LJSFi to the SL6 GK
The validations for SL6 releases would be sent to this GK, run on an SL6 C nodes.
Initially the BDII for the GK is empty and thus no jobs will be submitted to the two Panda Qs,
But as the validations succeed and the BDII becomes populated, jobs will be submitted to the GK
Gatekeeper submits jobs to SL6 nodes
The gatekeeper must participate in a Condor pool with only SL6 deployed nodes.
A Condor head node (collector/negotiator) separate from the SL5 pools is needed for this functionality
OSG WN-Client 3.1
The OSG WN-Client 3.1 is needed for SL6 support. This product will be deployed via CVMFS from the
OSG WN-Client 3.1 Tarball project.
Create SL6 Panda Queues
This procedure is done with
AGIS.
This involves three steps; Create a PANDA resource; Create a PANDA queue; Associate a CE (gatekeeper) with the Panda Queue.
Lastly the queues need to be change to "manual", "offline" and have APF enabled.
Create a PANDA Resource
Two new resources need to be created for MWT2 SL6 Queues. They are called MWT2_SL6 and ANALY_MWT2_SL6.
To create a new PANDA resource, select "Define PANDA resource" on the
AGIS home page
- In the "PANDA Site:" box, specify the site "MidwestT2". A popup of possibilities appears once you begin typing.
- Enter the "Name of PANDA resource" in that box; MWT2_SL6 or ANALY_MWT2_SL6.
- Select GRID as the "Resource type" from the pull-up list.
- Click the "Check input data" button.
- If all is well, a new button "Save PANDA Resource" button will appear. Click it and the resource will be created.
Create the PANDA Queue
The two new queues can now be created. AGIS has a nice clone function. Since most of the value we want to use for the new SL6 queues
will be the same as the current SL5 queue, we can just clone the existing queues and then make the appropriate changes.
To create a new PANDA Queue, select "Define PANDA queue" on the
AGIS home page
- In the "Specify PANDA Queue" box, enter the name of the queue to clone (MWT2-condor or ANALY-MWT2-condor), then click on "Clone"
- Change the PANDA Resource Name to the appropriate SL6 resource created above (MWT2_SL6 or ANALY_MWT2_SL6)
- Change the PANDA Queue Name. Use the same name as the resource for consitency (MWT2_SL6 or ANALY_MWT2_SL6)
- Specify the Type of the queue via the pull-down choices (production or analysis)
- Change the value of the jdl field to the same name as the resource (MWT2_SL6 or ANALY_MWT2_SL6)
- Change the gatekeeper to "mwt2-gk.campuscluster.illinois.edu"
- Click the "Save and continue" button at the bottom of the form
Associate a CE to the queue
The new queue has to be associated with a gatekeeper.
To find an modify a PANDA Queue, select "PANDA Queue" at the top of the
AGIS home page
- In the box above "Panda Site", enter "MidwestT2" This will filter the list to only MidwestT2 queues.
- Other filter option would be "MWT2" in the "Altlas Site", "Panda Queue" or "Panda Resource" fields
- Select the appropriate PANDA Queue to modify (MWT2_SL6 or ANALY_MWT2_SL6)
- Click on the "Find and associate another CE/Queue"
- In the "Search CE queues" box, type in the name of your gatekeeper (eg mwt2-gk) and click the "Search" button
- Check the box with the "default" entry
- Click on "Save". The gatekeeper/queue should now show up in the "Associated CE queues" section
Wait for the Queues to be created
The process of Queue creation by AGIS can take up to 20 minutes. If the new queues do not show up
after 30 minutes in the
Clouds,
Production or
Analysis pages, you will need to email Alden as the update system might be hung.
Modify the Panda "Status" settings
There are two settings on a Panda Queue that are not controlled by AGIS and need to be changed; "Status" and "Status Control"
Status Control should be "manual"
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=ANALY_MWT2_SL6'
Status should be "offline" until we are ready to test
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_MWT2_SL6'
Create an SL6 gatekeeper
MWT2 currently has four gatekeepers, one of which can be used to start the rolling upgrade.
At Illinois, many worker nodes are already deployed with EL6
(used for testing the
OSG WN-Client 3.1 Tarball deployment)
The gatekeeper at the Illinois regional site, mwt2-gk.campuscluster.illinois.edu, is a good candidate in the first step of a rolling deployment.
SL6 $APP
The SL6 cluster must use a $APP area that is completely independent of the SL5 cluster. On MWT2, the $APP area is referenced either by the path
/share/osg/mwt2/app or /osg/mwt2/app (softlink to /share/osg/mwt2/app). This area is either mounted via NFS on a node used for validation jobs
($APP) is writeable, or via CVMFS in a read-only mode.
For the SL6 cluster at Illinois, the $APP directory was created on the GPFS file system using the SL5 $APP as a template. The grid3-locations file
must be cleared of all tags as they will be re-validated with SL6 compatible releases.
cd /share/osg/mwt2
mv app app.sl5
cp -pr app.sl5 app
cd /share/osg/mwt2/app/etc
rm grid3-locations.txt
touch grid3-locations.txt
chown usatlas2:usatlas grid3-locations.txt
chmod 666 grid3-locations.txt
At a later date, this area will be moved to an NFS exportable file system on uct2-grid11.uchicago.edu. It will be replicated into the CVMFS repository at
/cvmfs/osg.mwt2.org/mwt2/app. All nodes will use the CVMFS distribution (via a soft-link /share/osg/mwt2/app --> /cvmfs/osg.mwt2.org/mwt2/app)
except for those nodes needing write access. Those nodes (normally only a validation node) will mount the file system directly.
* $APP Moved to uct2-grid11.uchicago.edu*
The above directory was migrated to uct2-grid11.uchicago.edu:/exports/sas/osg/mwt2/app.sl6.
This area was then replicated via rsync on the CVMFS stratum-0 server uct2-cvmfs.mwt2.org into /cvmfs/osg.mwt2/org/mwt2/app.
On the validation nodes (those requiring write access), this file system was mounted in a writeable fashion from uct2-grid11.
On all other nodes where read only access is needed to $APP, the soft-link for /share/osg/mwt2a/app --> /cvmfs/osg.mwt2.org/mwt2/app was created.
Remove FLOCKing
Flocking to/from the UChicago and Indiana gatekeepers/condors is removed to avoid cross contamination of SL5/SL6 jobs.
This isolates the gatekeeper from the other two regional sites. To do this, the condor configuration files in puppet for 30-flocking.conf
were modified to remove references to mwt2-gk.campuscluster.illinois.edu and mwt2-condor.campuscluster.illinois.edu.
These changes were then pushed to the other gatekeepers and condor head nodes. The same files was modified on mwt2-gk and mwt2-condor
to remove the FLOCK_TO and FLOCK_FROM definitions. As other regional sites are migration to SL6, their gatekeeper/condor pair will be moved
from the SL5 cluster and placed in the FLOCK on the SL6 clusters.
Change "cluster" and "resource_group"
The SL6 gatekeeper must not participate in the same cluster_name or resource_group as SL5 gatekeepers. This is to prevent the publication of
SL5 only validated releases. On mwt2-gk, the following changes were made
/etc/osg/config.d/30-gip.ini
cluster_name = MWT2-Condor
changed to
cluster_name = MWT2-SL6-Cluster
/etc/osg/config.40-siteinfo.ini
resource_group = MWT2
changed to
resource_group = MWT2-SL6
Currently it is unclear if the "resource_group" must be registered in the OIM for the site. The documentation states that it must, however, it appears that this is not the case. It is only important that the SL6 gatekeepers have different values from the SL5.
The following can be use to verify a gatekeeper is publishing the correct tags
lcg-info --vo atlas --list-ce --query 'CE=mwt2-gk.campuscluster.illinois.edu*' --attr Tag
Only the tags contained in the grid3-locations file accessible to the CE should be returned. If the tags from the SL5 gatekeepers or the value UNDEF is returned, then one or more of above values is the same as another gatekeeper
AutoPyFactory
The APF, needs to be reconfigured so as not to submit pilots for the existing Panda Queues, MWT2 and ANALY_MWT2 which are SL5 to the SL6 gatekeeper.
Jose removed the factories for the SL5 queues from mwt2-gk.campuscluster.illinois.edu. He then created new factories for the new SL6 queues to this gatekeeper. At this point the SL6 cluster should only receive pilots from the SL6 queues.
Worker node
The worker nodes associated with the SL6 cluster must be reinstalled with SL6, the