You are here: Foswiki>Main Web>CephSE (21 Aug 2015, LincolnBryant)Edit Attach

Ceph-based SE Deployment Instructions

Introduction

Ceph is a highly scalable object storage platform by Red Hat. This document will provide instructions for deploying a CephFS-based OSG Storage Element (SE).

Requirements

This document assumes the following:
Component If not..
Ceph cluster running http://ceph.com/docs/master/start/quick-ceph-deploy/
Host certificates in-place https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallCertAuth
Site installation of GUMS https://twiki.opensciencegrid.org/bin/view/ReleaseDocumentation/InstallConfigureAndManageGUMS

Deploying CephFS with erasure coding

Create the data pool

In our setup, we've chosen to use a erasure coding profile of 10 data chunks (k) and 3 (m) parity chunks. Since we have 14 servers (at time of writing), this allows us to tolerate 3 simultaneous disk failures or one node failure. As Ceph does not allow (by default) the same data to coexist on the same server, the upper limit of k+m is the number of servers in your configuration (for us, 14).

n.b. As the distribution of placement groups (pgs) is pseudorandom and not entirely homogenous, it's a best practice to make k + m < # of nodes. Otherwise some placement groups may not be able to be registered on any node and will therefore be stuck in a degraded state.

To create the erasure coded pool, you'll need to create an erasure coding profile, then create a pool that uses said profile:
# ceph osd erasure-code-profile set ec-k10-m3 k=10 m=3
# ceph osd pool create fs-data-ec 1024 1024 erasure ec-k10-m3

Create metadata and cache pools

CephFS cannot use an erasure coded pool directly, as operations like in-place updates are not supported by erasure coded pools. You'll need to front the erasure coded pool with a cache pool which will then flush objects down to the EC pool over time or as cache pressure requires.

CephFS also requires a pool for POSIX metadata (ownership, permissions, etc). It's typically very small compared to the data itself, so no need to erasure code it. You could easily (and probably should) increase the replication on this to some higher number (perhaps m+1).

# ceph osd pool create fs-hotpool 2048 2048 replicated
# ceph osd pool create fs-metadata 2048 2048 replicated
# ceph osd pool set fs-hotpool size 4
# ceph osd pool set fs-metadata size 4

Configuring the cache tier

Refer to the Ceph documentation for more information about the cache configuration.

# ceph osd tier add fs-data-ec fs-hotpool
# ceph osd tier cache-mode fs-hotpool writeback
# ceph osd tier set-overlay fs-data-ec fs-hotpool
# ceph osd pool set fs-hotpool target_max_bytes 5000000000000
# ceph osd pool set fs-hotpool cache_target_dirty_ratio 0.4
# ceph osd pool set fs-hotpool cache_target_full_ratio 0.8
# ceph osd pool set fs-hotpool cache_min_flush_age 600
# ceph osd pool set fs-hotpool cache_min_evict_age 1800
# ceph osd pool set fs-hotpool hit_set_type bloom

Stand up the metadata service

The Ceph filesystem requires an additional metadata service to be running. To deploy it, you'll need to figure out the integer pool numbers for your new pools.

# ceph osd lspools
26 rbd,35 fs-data-ec,36 fs-hotpool,37 fs-metadata,

Then, create a new filesystem under the MDS. The first parameter is the metadata pool, followed by the data pool. (NOT the cache pool)
# ceph mds newfs 37 35 --yes-i-really-mean-it

Monitoring Ceph

Installing BeStMan

First, make sure that you've installed the OSG repositories and EPEL. If not,
# yum install epel-release
# yum localinstall http://repo.grid.iu.edu/osg/3.2/osg-3.2-el6-release-latest.rpm

Once done, you'll need to also install the BeStMan RPM.
# yum install osg-se-bestman-xrootd

Configuring BeStMan

Edit the file /etc/bestman2/conf/bestman2.rc and modify/include the following lines:
supportedProtocolList=gsiftp://ceph03.grid.uchicago.edu;gsiftp://ceph04.grid.uchicago.edu;gsiftp://ceph05.grid.uchicago.edu;gsiftp://ceph06.grid.uchicago.edu;gsiftp://ceph07.grid.uchicago.edu;gsiftp://ceph08.grid.uchicago.edu;gsiftp://ceph09.grid.uchicago.edu;gsiftp://ceph10.grid.uchicago.edu;gsiftp://ceph11.grid.uchicago.edu;gsiftp://ceph12.grid.uchicago.edu;gsiftp://ceph13.grid.uchicago.edu;gsiftp://ceph14.grid.uchicago.edu;gsiftp://ceph15.grid.uchicago.edu;gsiftp://ceph16.grid.uchicago.edu
localPathListAllowed=/cephfs
securePort=8443
CertFileName=/etc/grid-security/bestman/bestmancert.pem
KeyFileName=/etc/grid-security/bestman/bestmankey.pem
staticTokenList=ATLASSCRATCHDISK[desc:ATLASSCRATCHDISK][20000][owner:usatlas1][retention:REPLICA][usedBytesCommand:/usr/local/bin/used-bytes.sh]
## change the checksum algorithm for ATLAS
defaultChecksumType=adler32
showChecksumWhenListingFile=true
hexChecksumCommand=sudo /usr/local/bin/adler32.py

A brief description of the more interesting parameters:
Config variable Description
supportedProtocolList List of GridFTP servers
localPathListAllowed Paths allowed to be accessed by transfer requests
CertFileName Certificate file, a copy of your hostcert.pem owned by the BeStMan user
KeyFileName Likewise for the hostkey.pem
staticTokenList Space tokens as defined in AGIS
hexChecksumCommand External plugin for returning the adler32 checksum of a file

Next you'll need to modify /etc/sudoers to include the BestMan user for privileged operations:

Cmnd_Alias SRM_CMD = /bin/rm, /bin/mkdir, /bin/rmdir, /bin/mv, /bin/cp, /bin/ls
Runas_Alias SRM_USR = ALL, !root
bestman   ALL=(SRM_USR) NOPASSWD: SRM_CMD

Finally, enable the service:
# service bestman2 start

Configuring GridFTP

Assuming that you are running GridFTP on a server seperate from your BeStMan server, first install GridFTP:
# yum install osg-gridftp

After installation, X509-based authorization will need to be configured. There are 2 methods for this, either using a static gridmap file or authorizing against a GUMS server. This document assumes the latter.

In order to configure GUMS auth, you'll need to edit /etc/grid-security/gsi-authz.conf and uncomment
globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout

Then, edit /etc/lcmaps.db:
# Change this URL to your GUMS server
             "--endpoint https://uct2-gums.mwt2.org:8443/gums/services/GUMSXACMLAuthorizationServicePort"

Note that any additional GridFTP servers will also need this configuration.

Configuring checksumming for ATLAS DDM

A few components are needed here. ATLAS DDM requires that files be checksummed with adler32. The most optimal way to do this is to...
  1. Store the checksum within the extended attributes of the file as it is being written
  2. Read it upon request with an external script

In order to checksum on the fly, you'll need to install a GridFTP DSI that I've repackaged from INFN and simply restart GridFTP. It's not required, but highly recommended.

# yum localinstall http://bootstrap.mwt2.org/repo/MWT2-SL6/gridftp-dsi-storm-adler32-1.2.0-1.el6.x86_64.rpm
# /etc/init.d/globus-gridftp-server restart

Recall that we have set the following in the BeStMan config:
defaultChecksumType=adler32
showChecksumWhenListingFile=true
hexChecksumCommand=sudo /usr/local/bin/adler32.py

The adler32.py plugin and its helper script calc_adler32.py are attached to this document. You'll need these to live in /usr/local/bin or modify hexChecksumCommand as appropriate.

Configuring the SE within AGIS

In order to configure a new SE in AGIS, you'll need to create a new DDM endpoint or find someone who can (I did not have permission to).

The permissions should be something like the following:
Config var Value Description
DDM Endpoint MWT2_UC_CEPH The endpoint name as you would like it to appear in the list of Rucio SEs
Type Test I initially used test for my endpoint so users did not move files to it.
SRM token:ATLASSCRATCHDISK:srm://ceph-se.osgconnect.net:8443/srm/v2/server?SFN= This token needs to match whatever is in the BeStMan configuration
Token ATLASSCRATCHDISK same as above
Domain .*ceph-se.osgconnect.net.*/.* Needs to match hostname of the BeStMan server
Site MWT2  
ATLAS Site MWT2  
Storage element MWT2-SRM-ceph (srm://ceph-se.osgconnect.net:8443/srm/v2/server?SFN=) Name seems to be arbitrary, but the SRM URL needs to match.
Endpoint /cephfs/atlas Needs to match whatever is mounted and exported on the GridFTP servers

-- LincolnBryant - 11 Aug 2015
Topic revision: r3 - 21 Aug 2015, LincolnBryant
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback