You are here: Foswiki>Main Web>SL6Migration (17 Jun 2013, DaveLesny)Edit Attach

SL6Migration

SL6 Migration
Rolling Transition one regional site at a time
Create SL6 Panda Queues
Create an SL6 gatekeeper
Worker node

SL6 Migration

USAtlas is scheduled to being migration of all Tier2 to SL6 starting June 1 with completion by June 30 (WLCG schedule is August). After some initial testing, a plan has been developed as to how to migrate a site such as MWT2.

Initial testing has shown that it is not possible to reliably run a site in a mixed SL5/SL6 configuration. The recommend procedure by Atlas at SLC6Readiness is to either convert an entire site in a "Big Bang" or by a "Rolling Transition".

The following plans are to try and move MWT2 via a "Rolling Transition".

Rolling Transition one regional site at a time

The easiest and safest way to upgrade MWT2 is via a rolling transition. Each regional site (UChicago, Indiana, Illinois) can be upgraded separately from the others. In this way only part of the MWT2 site will ever be down for an extended period while the worker nodes, etc are upgraded to SL6. Also, should a problem develop with the SL6 deployment, part of the site will remain with SL5 capabilities.

Since each regional site has its own gatekeeper, condor head node and condor pool of worker nodes, upgrading each site individually is an easy, less stressful procedure.

To perform a rolling transition, we need to take the following steps.

Create two new Panda Qs associated only with an SL6 GK

Clone MWT2 as MWT2_SL6 and ANALY_MWT2 as ANALY_MWT2_SL6 These clones would then associate only with the SL6 gatekeeper.

Create an "SL6" enabled gatekeeper

This gatekeeper will advertize to the BDII only SL6 validated releases The $APP (and grid3-locations) would be different that those on the SL5 nodes Pilots which glide into this node will then be run only on SL6 compute nodes.

Setup a new validation in LJSFi to the SL6 GK

The validations for SL6 releases would be sent to this GK, run on an SL6 C nodes. Initially the BDII for the GK is empty and thus no jobs will be submitted to the two Panda Qs, But as the validations succeed and the BDII becomes populated, jobs will be submitted to the GK

Gatekeeper submits jobs to SL6 nodes

The gatekeeper must participate in a Condor pool with only SL6 deployed nodes. A Condor head node (collector/negotiator) separate from the SL5 pools is needed for this functionality

OSG WN-Client 3.1

The OSG WN-Client 3.1 is needed for SL6 support. This product will be deployed via CVMFS from the OSG WN-Client 3.1 Tarball project.

Create SL6 Panda Queues

This procedure is done with AGIS. This involves three steps; Create a PANDA resource; Create a PANDA queue; Associate a CE (gatekeeper) with the Panda Queue. Lastly the queues need to be change to "manual", "offline" and have APF enabled.

Create a PANDA Resource

Two new resources need to be created for MWT2 SL6 Queues. They are called MWT2_SL6 and ANALY_MWT2_SL6.

To create a new PANDA resource, select "Define PANDA resource" on the AGIS home page

In the "PANDA Site:" box, specify the site "MidwestT2". A popup of possibilities appears once you begin typing.
Enter the "Name of PANDA resource" in that box; MWT2_SL6 or ANALY_MWT2_SL6.
Select GRID as the "Resource type" from the pull-up list.
Click the "Check input data" button.
If all is well, a new button "Save PANDA Resource" button will appear. Click it and the resource will be created.

Create the PANDA Queue

The two new queues can now be created. AGIS has a nice clone function. Since most of the value we want to use for the new SL6 queues will be the same as the current SL5 queue, we can just clone the existing queues and then make the appropriate changes.

To create a new PANDA Queue, select "Define PANDA queue" on the AGIS home page

In the "Specify PANDA Queue" box, enter the name of the queue to clone (MWT2-condor or ANALY-MWT2-condor), then click on "Clone"
Change the PANDA Resource Name to the appropriate SL6 resource created above (MWT2_SL6 or ANALY_MWT2_SL6)
Change the PANDA Queue Name. Use the same name as the resource for consitency (MWT2_SL6 or ANALY_MWT2_SL6)
Specify the Type of the queue via the pull-down choices (production or analysis)
Change the value of the jdl field to the same name as the resource (MWT2_SL6 or ANALY_MWT2_SL6)
Change the gatekeeper to "mwt2-gk.campuscluster.illinois.edu"
Click the "Save and continue" button at the bottom of the form

Associate a CE to the queue

The new queue has to be associated with a gatekeeper.

To find an modify a PANDA Queue, select "PANDA Queue" at the top of the AGIS home page

In the box above "Panda Site", enter "MidwestT2" This will filter the list to only MidwestT2 queues.
Other filter option would be "MWT2" in the "Altlas Site", "Panda Queue" or "Panda Resource" fields
Select the appropriate PANDA Queue to modify (MWT2_SL6 or ANALY_MWT2_SL6)
Click on the "Find and associate another CE/Queue"
In the "Search CE queues" box, type in the name of your gatekeeper (eg mwt2-gk) and click the "Search" button
Check the box with the "default" entry
Click on "Save". The gatekeeper/queue should now show up in the "Associated CE queues" section

Wait for the Queues to be created

The process of Queue creation by AGIS can take up to 20 minutes. If the new queues do not show up after 30 minutes in the Clouds, Production or Analysis pages, you will need to email Alden as the update system might be hung.

Modify the Panda "Status" settings

There are two settings on a Panda Queue that are not controlled by AGIS and need to be changed; "Status" and "Status Control"

Status Control should be "manual"

curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setmanual&queue=ANALY_MWT2_SL6'

Status should be "offline" until we are ready to test

curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=MWT2_SL6'
curl -k --cert /tmp/x509up_u`id -u` 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_MWT2_SL6'

Create an SL6 gatekeeper

MWT2 currently has four gatekeepers, one of which can be used to start the rolling upgrade. At Illinois, many worker nodes are already deployed with EL6 (used for testing the OSG WN-Client 3.1 Tarball deployment) The gatekeeper at the Illinois regional site, mwt2-gk.campuscluster.illinois.edu, is a good candidate in the first step of a rolling deployment.

SL6 $APP

The SL6 cluster must use a $APP area that is completely independent of the SL5 cluster. On MWT2, the $APP area is referenced either by the path /share/osg/mwt2/app or /osg/mwt2/app (softlink to /share/osg/mwt2/app). This area is either mounted via NFS on a node used for validation jobs ($APP) is writeable, or via CVMFS in a read-only mode.

For the SL6 cluster at Illinois, the $APP directory was created on the GPFS file system using the SL5 $APP as a template. The grid3-locations file must be cleared of all tags as they will be re-validated with SL6 compatible releases.

cd /share/osg/mwt2
mv app app.sl5
cp -pr app.sl5 app
cd /share/osg/mwt2/app/etc
rm grid3-locations.txt
touch grid3-locations.txt
chown usatlas2:usatlas grid3-locations.txt
chmod 666 grid3-locations.txt

At a later date, this area will be moved to an NFS exportable file system on uct2-grid11.uchicago.edu. It will be replicated into the CVMFS repository at /cvmfs/osg.mwt2.org/mwt2/app. All nodes will use the CVMFS distribution (via a soft-link /share/osg/mwt2/app --> /cvmfs/osg.mwt2.org/mwt2/app) except for those nodes needing write access. Those nodes (normally only a validation node) will mount the file system directly.

* $APP Moved to uct2-grid11.uchicago.edu*

The above directory was migrated to uct2-grid11.uchicago.edu:/exports/sas/osg/mwt2/app.sl6. This area was then replicated via rsync on the CVMFS stratum-0 server uct2-cvmfs.mwt2.org into /cvmfs/osg.mwt2/org/mwt2/app. On the validation nodes (those requiring write access), this file system was mounted in a writeable fashion from uct2-grid11. On all other nodes where read only access is needed to $APP, the soft-link for /share/osg/mwt2a/app --> /cvmfs/osg.mwt2.org/mwt2/app was created.

Remove FLOCKing

Flocking to/from the UChicago and Indiana gatekeepers/condors is removed to avoid cross contamination of SL5/SL6 jobs. This isolates the gatekeeper from the other two regional sites. To do this, the condor configuration files in puppet for 30-flocking.conf were modified to remove references to mwt2-gk.campuscluster.illinois.edu and mwt2-condor.campuscluster.illinois.edu. These changes were then pushed to the other gatekeepers and condor head nodes. The same files was modified on mwt2-gk and mwt2-condor to remove the FLOCK_TO and FLOCK_FROM definitions. As other regional sites are migration to SL6, their gatekeeper/condor pair will be moved from the SL5 cluster and placed in the FLOCK on the SL6 clusters.

Change "cluster" and "resource_group"

The SL6 gatekeeper must not participate in the same cluster_name or resource_group as SL5 gatekeepers. This is to prevent the publication of SL5 only validated releases. On mwt2-gk, the following changes were made

/etc/osg/config.d/30-gip.ini


   cluster_name = MWT2-Condor

changed to

   cluster_name = MWT2-SL6-Cluster

/etc/osg/config.40-siteinfo.ini


   resource_group = MWT2

changed to

   resource_group = MWT2-SL6

Currently it is unclear if the "resource_group" must be registered in the OIM for the site. The documentation states that it must, however, it appears that this is not the case. It is only important that the SL6 gatekeepers have different values from the SL5.

The following can be use to verify a gatekeeper is publishing the correct tags

lcg-info --vo atlas --list-ce  --query 'CE=mwt2-gk.campuscluster.illinois.edu*' --attr Tag

Only the tags contained in the grid3-locations file accessible to the CE should be returned. If the tags from the SL5 gatekeepers or the value UNDEF is returned, then one or more of above values is the same as another gatekeeper

AutoPyFactory

The APF, needs to be reconfigured so as not to submit pilots for the existing Panda Queues, MWT2 and ANALY_MWT2 which are SL5 to the SL6 gatekeeper. Jose removed the factories for the SL5 queues from mwt2-gk.campuscluster.illinois.edu. He then created new factories for the new SL6 queues to this gatekeeper. At this point the SL6 cluster should only receive pilots from the SL6 queues.

Worker node

The worker nodes associated with the SL6 cluster must be reinstalled with SL6, the

Topic revision: r6 - 17 Jun 2013, DaveLesny

Main

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback