Ceph-based SE Deployment Instructions
Introduction
Ceph is a highly scalable object storage platform by Red Hat. This document will provide instructions for deploying a CephFS-based OSG Storage Element (SE).
Requirements
This document assumes the following:
Deploying CephFS with erasure coding
Create the data pool
In our setup, we've chosen to use a erasure coding profile of 10 data chunks (
k) and 3 (
m) parity chunks. Since we have 14 servers (at time of writing), this allows us to tolerate 3 simultaneous disk failures or one node failure. As Ceph does not allow (by default) the same data to coexist on the same server, the upper limit of k+m is the number of servers in your configuration (for us, 14).
n.b. As the distribution of placement groups (pgs) is pseudorandom and not entirely homogenous, it's a best practice to make k + m < # of nodes. Otherwise some placement groups may not be able to be registered on any node and will therefore be stuck in a degraded state.
To create the erasure coded pool, you'll need to create an erasure coding profile, then create a pool that uses said profile:
# ceph osd erasure-code-profile set ec-k10-m3 k=10 m=3
# ceph osd pool create fs-data-ec 1024 1024 erasure ec-k10-m3
CephFS cannot use an erasure coded pool directly, as operations like in-place updates are not supported by erasure coded pools. You'll need to front the erasure coded pool with a cache pool which will then flush objects down to the EC pool over time or as cache pressure requires.
CephFS also requires a pool for POSIX metadata (ownership, permissions, etc). It's typically very small compared to the data itself, so no need to erasure code it. You could easily (and probably should) increase the replication on this to some higher number (perhaps
m+1).
# ceph osd pool create fs-hotpool 2048 2048 replicated
# ceph osd pool create fs-metadata 2048 2048 replicated
# ceph osd pool set fs-hotpool size 4
# ceph osd pool set fs-metadata size 4
Configuring the cache tier
Refer to the Ceph documentation for more information about the cache configuration.
# ceph osd tier add fs-data-ec fs-hotpool
# ceph osd tier cache-mode fs-hotpool writeback
# ceph osd tier set-overlay fs-data-ec fs-hotpool
# ceph osd pool set fs-hotpool target_max_bytes 5000000000000
# ceph osd pool set fs-hotpool cache_target_dirty_ratio 0.4
# ceph osd pool set fs-hotpool cache_target_full_ratio 0.8
# ceph osd pool set fs-hotpool cache_min_flush_age 600
# ceph osd pool set fs-hotpool cache_min_evict_age 1800
# ceph osd pool set fs-hotpool hit_set_type bloom
The Ceph filesystem requires an additional metadata service to be running. To deploy it, you'll need to figure out the integer pool numbers for your new pools.
# ceph osd lspools
26 rbd,35 fs-data-ec,36 fs-hotpool,37 fs-metadata,
Then, create a new filesystem under the MDS. The first parameter is the metadata pool, followed by the data pool. (
NOT the cache pool)
# ceph mds newfs 37 35 --yes-i-really-mean-it
Monitoring Ceph
Installing BeStMan
First, make sure that you've installed the OSG repositories and EPEL. If not,
# yum install epel-release
# yum localinstall http://repo.grid.iu.edu/osg/3.2/osg-3.2-el6-release-latest.rpm
Once done, you'll need to also install the BeStMan RPM.
# yum install osg-se-bestman-xrootd
Edit the file
/etc/bestman2/conf/bestman2.rc
and modify/include the following lines:
supportedProtocolList=gsiftp://ceph03.grid.uchicago.edu;gsiftp://ceph04.grid.uchicago.edu;gsiftp://ceph05.grid.uchicago.edu;gsiftp://ceph06.grid.uchicago.edu;gsiftp://ceph07.grid.uchicago.edu;gsiftp://ceph08.grid.uchicago.edu;gsiftp://ceph09.grid.uchicago.edu;gsiftp://ceph10.grid.uchicago.edu;gsiftp://ceph11.grid.uchicago.edu;gsiftp://ceph12.grid.uchicago.edu;gsiftp://ceph13.grid.uchicago.edu;gsiftp://ceph14.grid.uchicago.edu;gsiftp://ceph15.grid.uchicago.edu;gsiftp://ceph16.grid.uchicago.edu
localPathListAllowed=/cephfs
securePort=8443
CertFileName=/etc/grid-security/bestman/bestmancert.pem
KeyFileName=/etc/grid-security/bestman/bestmankey.pem
staticTokenList=ATLASSCRATCHDISK[desc:ATLASSCRATCHDISK][20000][owner:usatlas1][retention:REPLICA][usedBytesCommand:/usr/local/bin/used-bytes.sh]
## change the checksum algorithm for ATLAS
defaultChecksumType=adler32
showChecksumWhenListingFile=true
hexChecksumCommand=sudo /usr/local/bin/adler32.py
A brief description of the more interesting parameters:
Config variable |
Description |
supportedProtocolList |
List of GridFTP servers |
localPathListAllowed |
Paths allowed to be accessed by transfer requests |
CertFileName |
Certificate file, a copy of your hostcert.pem owned by the BeStMan user |
KeyFileName |
Likewise for the hostkey.pem |
staticTokenList |
Space tokens as defined in AGIS |
hexChecksumCommand |
External plugin for returning the adler32 checksum of a file |
Next you'll need to modify
/etc/sudoers
to include the
BestMan user for privileged operations:
Cmnd_Alias SRM_CMD = /bin/rm, /bin/mkdir, /bin/rmdir, /bin/mv, /bin/cp, /bin/ls
Runas_Alias SRM_USR = ALL, !root
bestman ALL=(SRM_USR) NOPASSWD: SRM_CMD
Finally, enable the service:
# service bestman2 start
Assuming that you are running
GridFTP on a server
seperate from your
BeStMan server, first install
GridFTP:
# yum install osg-gridftp
After installation, X509-based authorization will need to be configured. There are 2 methods for this, either using a static
gridmap
file or authorizing against a GUMS server. This document assumes the latter.
In order to configure GUMS auth, you'll need to edit
/etc/grid-security/gsi-authz.conf
and uncomment
globus_mapping liblcas_lcmaps_gt4_mapping.so lcmaps_callout
Then, edit
/etc/lcmaps.db
:
# Change this URL to your GUMS server
"--endpoint https://uct2-gums.mwt2.org:8443/gums/services/GUMSXACMLAuthorizationServicePort"
Note that any additional
GridFTP servers will also need this configuration.
Configuring checksumming for ATLAS DDM
A few components are needed here. ATLAS DDM requires that files be checksummed with adler32. The most optimal way to do this is to...
- Store the checksum within the extended attributes of the file as it is being written
- Read it upon request with an external script
In order to checksum on the fly, you'll need to install a
GridFTP DSI that I've repackaged from INFN and simply restart
GridFTP. It's not required, but highly recommended.
# yum localinstall http://bootstrap.mwt2.org/repo/MWT2-SL6/gridftp-dsi-storm-adler32-1.2.0-1.el6.x86_64.rpm
# /etc/init.d/globus-gridftp-server restart
Recall that we have set the following in the
BeStMan config:
defaultChecksumType=adler32
showChecksumWhenListingFile=true
hexChecksumCommand=sudo /usr/local/bin/adler32.py
The adler32.py plugin and its helper script calc_adler32.py are attached to this document. You'll need these to live in /usr/local/bin or modify
hexChecksumCommand
as appropriate.
Configuring the SE within AGIS
In order to configure a new SE in AGIS, you'll need to create a new DDM endpoint or find someone who can (I did not have permission to).
The permissions should be something like the following:
Config var |
Value |
Description |
DDM Endpoint |
MWT2_UC_CEPH |
The endpoint name as you would like it to appear in the list of Rucio SEs |
Type |
Test |
I initially used test for my endpoint so users did not move files to it. |
SRM |
token:ATLASSCRATCHDISK:srm://ceph-se.osgconnect.net:8443/srm/v2/server?SFN= |
This token needs to match whatever is in the BeStMan configuration |
Token |
ATLASSCRATCHDISK |
same as above |
Domain |
.*ceph-se.osgconnect.net.*/.* |
Needs to match hostname of the BeStMan server |
Site |
MWT2 |
|
ATLAS Site |
MWT2 |
|
Storage element |
MWT2-SRM-ceph (srm://ceph-se.osgconnect.net:8443/srm/v2/server?SFN=) |
Name seems to be arbitrary, but the SRM URL needs to match. |
Endpoint |
/cephfs/atlas |
Needs to match whatever is mounted and exported on the GridFTP servers |
--
LincolnBryant - 11 Aug 2015