Ceph Install
This guide should hopefully make it possible for you to set up a fault-tolerant Ceph cluster for 'volatile' storage. When completed, the cluster will have:
- One metadata node (recommended for production)
- 2x replication (i.e., half capacity. effectively mirroring, but distributed across as many nodes as you have available)
- Horizontal expansion by adding more nodes/disks
- Full POSIX compliance!
Considerations
There are a number of parameters we ought to consider.
- Kernel. The stock kernel provided for Scientific Linux 6.4 is 2.6.32. This is rather old for newer filesystems, especially BRTFS.
- I have a custom kernel (3.11.0-UL2) available that supports the latest CephFS patches.
- Filesystem. Somewhat dependent on kernel and also Ceph's development state. From most stable to least stable (as far as Ceph is concerned): XFS/EXT4, BTRFS, ZFS
- Replication/RAID levels. It is recommended against using RAID with Ceph, as RADOS does many of the things that are normally ascribed to RAID.
Configuring the disks
The Ceph team recommends using XFS for production servers, although upstream Ceph development is centered around BTRFS.
This handy script will find all of the non-root disks in the machine and partition them as XFS (up to /dev/sdz):
DISKS=$(grep "[hsv]d[b-z]$" /proc/partitions | awk {'print $4'} | sort|xargs); for disk in $DISKS; do parted -s /dev/${disk} mklabel gpt; mkfs.xfs -f /dev/${disk}; done
Installing the Ceph repo
We are currently installing Ceph from Puppet, using the Ceph repos:
rpm -Uvh http://ceph.com/rpm-cuttlefish/el6/x86_64/ceph-release-1-0.el6.noarch.rpm
There's also a separate, optional repo for the Ceph deployment tool, 'ceph-deploy'. I personally don't bother.
Bootstrapping the Ceph Monitor
In lieu of using the ceph-deploy tool, we'll bootstrap the Ceph monitor from scratch. Here's what a minimal /etc/ceph/ceph.conf looks like:
[global]
auth cluster required = none
auth service required = none
auth client required = none
[mon.a]
host = dfs-m1
mon addr = 10.1.6.254
For production, you'll probably want to have 3 monitors and authentication turned on.
You'll first need to create a keyring for the Ceph monitors:
# ceph-authtool --create-keyring /etc/ceph/keyring --gen-key -n mon.
Once that's created, you'll need to create the initial map for Ceph's monitors.
monmaptool --create --add a 10.1.5.67 --clobber monmap
Finally, instantiate the new monitor with the initial keyring and initial map. This monitor will be called "mon.a":
ceph-mon --mkfs -i a --monmap /etc/ceph/monmap --keyring /etc/ceph/keyring
The Ceph metadata server is required for
CephFS deployments, i.e., if you want a distributed filesystem with POSIX semantics.
When Ceph is deployed, it should have created three pools: 'data', 'metadata', and 'rbd'. Since the
ceph mds
tool only takes pool ID numbers rather than pool names, we'll need to grab the ID numbers for 'data' and 'metadata' like so:
# ceph osd lspools
0 metadata,1 data,2 rbd,
You can then take these values and input them into
ceph mds
:
# ceph mds newfs 0 1 --yes-i-really-mean-it
Finally, add this entry to ceph.conf:
[mds.0]
host = faxbox2
addr = 10.1.5.67
You can then start the metadata daemon:
/etc/init.d/ceph start mds.0
Adding additional Monitors
Basically, follow this guide on the new monitor servers (m2, m3):
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/
Configuring the Object Storage Daemons (OSDs)
The basic element of Ceph storage is the Object Storage Daemon. Generally, one OSD runs per disk available on the node. Alternatively, RAID can be used to present a single block device to an OSD, as-in the case where disk-level redundancy is desired.
We'll need to create a directory for each disk we plan to mount on the storage node.
[root@dfs-s001 ~]# mkdir -p /var/lib/ceph/osd/ceph-{0,1,2,3,4,5}
And add the appropriate entries to /etc/fstab. Keep in mind that these need to increment for each successive server.
/dev/sdc /var/lib/ceph/osd/ceph-0 xfs defaults 0 0
/dev/sde /var/lib/ceph/osd/ceph-1 xfs defaults 0 0
/dev/sdd /var/lib/ceph/osd/ceph-2 xfs defaults 0 0
/dev/sdf /var/lib/ceph/osd/ceph-3 xfs defaults 0 0
/dev/sdg /var/lib/ceph/osd/ceph-4 xfs defaults 0 0
/dev/sdh /var/lib/ceph/osd/ceph-5 xfs defaults 0 0
For our second server, it would be..
[root@dfs-s002 osd]# mkdir -p /var/lib/ceph/osd/ceph-{6,7,8,9,10,11}
And the respective fstab:
/dev/sdc /var/lib/ceph/osd/ceph-6 xfs defaults 0 0
/dev/sde /var/lib/ceph/osd/ceph-7 xfs defaults 0 0
/dev/sdd /var/lib/ceph/osd/ceph-8 xfs defaults 0 0
/dev/sdf /var/lib/ceph/osd/ceph-9 xfs defaults 0 0
/dev/sdg /var/lib/ceph/osd/ceph-10 xfs defaults 0 0
/dev/sdh /var/lib/ceph/osd/ceph-11 xfs defaults 0 0
If we're adding these after the ceph cluster has been created, we need to inform the Ceph master about the new OSDs:
[root@dfs-m1 ceph]# ceph osd create
6
[root@dfs-m1 ceph]# ceph osd create
7
[root@dfs-m1 ceph]# ceph osd create
8
[root@dfs-m1 ceph]# ceph osd create
9
[root@dfs-m1 ceph]# ceph osd create
10
[root@dfs-m1 ceph]# ceph osd create
11
We'll also need to create the OSD filesystem on the node.
[root@dfs-s002 ceph]# ceph-osd -i 6 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 7 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 8 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 9 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 10 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 11 --mkfs
or, simply
[root@dfs-s002 ceph]# for i in {6..11}; do ceph-osd -i $i --mkfs; done
Finally, start the service via
[root@dfs-001 ~]# /etc/init.d/ceph start
See more:
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual
Creating a dedicated journal device
You'll need to create a journal device and mount it to some directory such as /ceph/journal, and then have your OSDs put their journal on that disk. Here's what needs to be in your Ceph.conf:
[osd]
osd journal size = 1000
filestore xattr use omap = true
osd journal = /ceph/journal/osd.$id.journal ;only if you want put journal on a dedicated disk, like ssd
journal dio = true
journal aio = true
Configuring the Ceph Monitors
Each node also runs a monitor. The appropriate directory will need to be created on each node. The order is determiend by the ceph.conf file.
[root@dfs-s001 ~]# mkdir -p /var/lib/ceph/mon/ceph-b
[root@dfs-s002 ~]# mkdir -p /var/lib/ceph/mon/ceph-c
[root@dfs-s003 ~]# mkdir -p /var/lib/ceph/mon/ceph-d
[root@dfs-s004 ~]# mkdir -p /var/lib/ceph/mon/ceph-e
Configuring the Ceph filesystem (kernel module)
First you'll need to install the 3.7.8-UL2 kernel and
modprobe ceph
Then, mount the ceph filesystem:
[root@dfs-m1 ~]# mount -t ceph 10.1.6.254:6789:/ /mnt
Rough Ceph benchmarks
date |
write (MB/s) |
read (MB/s) |
kernel |
underlying filesystem |
Topology |
Notes |
to be updated
Troubleshooting Ceph
http://eu.ceph.com/docs/v0.47.2/ops/manage/failures/osd/
https://wikitech.wikimedia.org/wiki/Ceph
Identifying slow nodes
for id in {0..59}; do ceph osd tell $id bench; done
then grep "bench" in /var/log/ceph/ceph.log
If they're ~6MB/s, do this:
sdparm --set=WCE /dev/sd{c,d,e,f,g,h} --sav
--+++ pg errors
If you see errors like this:
[root@dfs-m1 ~]# ceph -s
health HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; recovery recovering 1 o/s, 2402B/s
monmap e1: 1 mons at {a=10.1.6.254:6789/0}, election epoch 2, quorum 0 a
osdmap e84: 24 osds: 24 up, 24 in
pgmap v201: 192 pgs: 189 active+clean, 3 peering; 9518 bytes data, 21718 MB used, 14639 GB / 14660 GB avail; recovering 1 o/s, 2402B/s
mdsmap e4: 1/1/1 up {0=a=up:active}
This is probably because you added a new OSDs and Ceph is rebalancing.
Likewise, if you see this:
[root@dfs-m1 ~]# ceph -s
health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
You've likely just set up Ceph and need to add some OSDs.
Possibly bad disks:
[root@dfs-m1 distribution]# cat /var/log/ceph/ceph.log | grep "osd.[0-9][0-9]" | awk '{print $3}' | ./distribution
Val |Ct (Pct) Histogram
osd.34|203552 (69.69%) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
osd.23|14208 (4.86%) ++++
osd.12|10675 (3.65%) +++
osd.20|8803 (3.01%) +++
osd.14|8120 (2.78%) +++
osd.57|5276 (1.81%) ++
osd.35|4716 (1.61%) ++
osd.58|3823 (1.31%) ++
osd.54|3619 (1.24%) +
osd.33|3083 (1.06%) +
osd.32|2573 (0.88%) +
osd.11|2215 (0.76%) +
mon.0 |2169 (0.74%) +
osd.13|1898 (0.65%) +
osd.30|1770 (0.61%) +
Too few PGs
[root@dfs-s010 ceph-59]# ceph -w
cluster 996b1c9b-ea75-4d39-a30a-dc212de4b8ae
health HEALTH_WARN too few pgs per osd (16 < min 20)
monmap e1: 1 mons at {a=10.1.6.253:6789/0}, election epoch 2, quorum 0 a
osdmap e96: 60 osds: 12 up, 12 in
pgmap v145: 192 pgs, 3 pools, 0 bytes data, 0 objects
-
-
-
- MB used, 8365 GB / 8377 GB avail 192 active+clean
You'll need to increase both the pg_num and pgp_num for each pool. To list pools:
# rados lspools
data
metadata
rbd
To create the placement groups:
# ceph osd pool set rbd pg_num 1200
# ceph osd pool set rbd pgp_num 1200
# ceph osd pool set data pg_num 1200
# ceph osd pool set data pgp_num 1200
# ceph osd pool set metadata pg_num 1200
# ceph osd pool set metadata pgp_num 1200
Mon unresponsive
I ran into a problem where the Ceph mon became unresponsive after trying to add a second mon. Here's a bit of surgery that you can use to fix the problem.
First, set the cluster to "no out" mode, such that it wont start rebalancing operations when we bring OSDs down.
# ceph osd set noout
Next, bring down the Ceph services
# /etc/init.d/ceph stop
Once Ceph is down and the monitor service is stopped, extract the last known good monitor map. This assumes the monitor's ID is "a":
ceph-mon -i a --extract-monmap /tmp/monmap
Now you'll need to use the monmaptool to print the contents of the map:
# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 2
fsid 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
last_changed 2013-12-27 13:42:00.400911
created 2013-10-03 12:43:16.858425
0: 10.1.5.67:6789/0 mon.a
1: 192.170.227.122:6789/0 mon.b
Backup the monmap, and then remove the faulty monitor (in this case, mon b) again with monmap tool:
# cp /tmp/monmap /tmp/monmap-backup
# monmaptool --print monmap --rm b
monmaptool: monmap file /tmp/monmap
epoch 2
fsid 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
last_changed 2013-12-27 13:42:00.400911
created 2013-10-03 12:43:16.858425
0: 10.1.5.67:6789/0 mon.a
The faulty monitor has been removed. Go ahead and re-inject the monitor map into the mon daemon and restart Ceph:
# ceph-mon -i a --inject-monmap /tmp/monmap
# /etc/init.d/ceph start
Finally, check the cluster health:
# ceph -s
cluster 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
health HEALTH_OK
Example config
[global]
auth cluster required = none
auth service required = none
auth client required = none
[osd]
osd journal size = 1000
filestore xattr use omap = true
[mon.a]
host = dfs-m1
mon addr = 10.1.6.254
[osd.0]
host = dfs-s001
[osd.1]
host = dfs-s001
[osd.2]
host = dfs-s001
[osd.3]
host = dfs-s001
[osd.4]
host = dfs-s001
[osd.5]
host = dfs-s001
[osd.6]
host = dfs-s002
[osd.7]
host = dfs-s002
[osd.8]
host = dfs-s002
[osd.9]
host = dfs-s002
[osd.10]
host = dfs-s002
[osd.11]
host = dfs-s002
[mds.a]
host = dfs-m1
Wiping out Ceph and starting over
These directories will need to be wiped out if you plan to restart fresh with Ceph:
- /var/lib/ceph/osd/ceph-{0..N} <-- OSD filesystem mounts on workers
-
--++ Benchmarking Ceph
Here's a small script I wrote to benchmark Ceph's writes for a variety of thread counts:
#!/bin/sh
# Small benchmarking script for Ceph
# We'll want to drop caches any time we start a new benchmark.
function drop_caches {
echo 3 > /proc/sys/vm/drop_caches
}
function benchy_write {
echo "======================================================================"
echo "WRITES for ${1-300} sec"
echo "threads,sec,Cur_ops,started,finished,avg_MB/s,cur_MB/s,last_lat,avg_lat"
echo "======================================================================"
for i in {1,2,4,8,16,32}; do
drop_caches
rados bench -p test ${1-300} write -t $i --no-cleanup | tail -n +5 | head -n +${1-300} | sed -e 's/^[ \t]*//' | tr -s ' ' | sed -e 's/ /,/g' | grep -v "^2013\|^sec" | while read line; do echo "$i,$line"; done
done
echo "======================================================================"
}
benchy_write
Other scripts
Recreate the disks and re-add them to Ceph after a new install:
for i in $(mount | grep ceph- | cut -d'-' -f2 | awk '{print $1}'); do ceph osd create && ceph-osd -i $i --mkfs; done
Don't recreate the disks, but reinstall Ceph OSD:
/etc/init.d/ceph stop; for i in $(ls /var/lib/ceph/osd/) ; do cd /var/lib/ceph/osd/$i && rm -rf *; done; for i in $(ls /var/lib/ceph/osd | grep ceph | cut -d'-' -f2); do ceph-osd -i $i --mkfs; done
--
LincolnBryant - 22 Jul 2013