You are here: Foswiki>Main Web>CephInstall (03 Jan 2014, LincolnBryant)Edit Attach

Ceph Install

This guide should hopefully make it possible for you to set up a fault-tolerant Ceph cluster for 'volatile' storage. When completed, the cluster will have:

One metadata node (recommended for production)
2x replication (i.e., half capacity. effectively mirroring, but distributed across as many nodes as you have available)
Horizontal expansion by adding more nodes/disks
Full POSIX compliance!

Considerations

There are a number of parameters we ought to consider.

Kernel. The stock kernel provided for Scientific Linux 6.4 is 2.6.32. This is rather old for newer filesystems, especially BRTFS.
- I have a custom kernel (3.11.0-UL2) available that supports the latest CephFS patches.
Filesystem. Somewhat dependent on kernel and also Ceph's development state. From most stable to least stable (as far as Ceph is concerned): XFS/EXT4, BTRFS, ZFS
- See FaxBoxConfig for benchmarks.
Replication/RAID levels. It is recommended against using RAID with Ceph, as RADOS does many of the things that are normally ascribed to RAID.

Configuring the disks

The Ceph team recommends using XFS for production servers, although upstream Ceph development is centered around BTRFS.

This handy script will find all of the non-root disks in the machine and partition them as XFS (up to /dev/sdz):

DISKS=$(grep "[hsv]d[b-z]$" /proc/partitions | awk {'print $4'} | sort|xargs); for disk in $DISKS; do parted -s /dev/${disk} mklabel gpt; mkfs.xfs -f /dev/${disk}; done

Installing the Ceph repo

We are currently installing Ceph from Puppet, using the Ceph repos:

rpm -Uvh http://ceph.com/rpm-cuttlefish/el6/x86_64/ceph-release-1-0.el6.noarch.rpm

There's also a separate, optional repo for the Ceph deployment tool, 'ceph-deploy'. I personally don't bother.

Bootstrapping the Ceph Monitor

In lieu of using the ceph-deploy tool, we'll bootstrap the Ceph monitor from scratch. Here's what a minimal /etc/ceph/ceph.conf looks like:

[global]
        auth cluster required = none
        auth service required = none
        auth client required = none

[mon.a]
        host = dfs-m1
        mon addr = 10.1.6.254

For production, you'll probably want to have 3 monitors and authentication turned on.

You'll first need to create a keyring for the Ceph monitors:

# ceph-authtool --create-keyring /etc/ceph/keyring --gen-key -n mon.

Once that's created, you'll need to create the initial map for Ceph's monitors.

monmaptool --create --add a 10.1.5.67 --clobber monmap

Finally, instantiate the new monitor with the initial keyring and initial map. This monitor will be called "mon.a":

ceph-mon --mkfs -i a --monmap /etc/ceph/monmap --keyring /etc/ceph/keyring

Adding a metadata server (optional)

The Ceph metadata server is required for CephFS deployments, i.e., if you want a distributed filesystem with POSIX semantics.

When Ceph is deployed, it should have created three pools: 'data', 'metadata', and 'rbd'. Since the ceph mds tool only takes pool ID numbers rather than pool names, we'll need to grab the ID numbers for 'data' and 'metadata' like so:

# ceph osd lspools
0 metadata,1 data,2 rbd,

You can then take these values and input them into ceph mds:

# ceph mds newfs 0 1 --yes-i-really-mean-it

Finally, add this entry to ceph.conf:

[mds.0]
        host = faxbox2
        addr = 10.1.5.67

You can then start the metadata daemon:

/etc/init.d/ceph start mds.0

Adding additional Monitors

Basically, follow this guide on the new monitor servers (m2, m3): http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

Configuring the Object Storage Daemons (OSDs)

The basic element of Ceph storage is the Object Storage Daemon. Generally, one OSD runs per disk available on the node. Alternatively, RAID can be used to present a single block device to an OSD, as-in the case where disk-level redundancy is desired. We'll need to create a directory for each disk we plan to mount on the storage node.

[root@dfs-s001 ~]# mkdir -p /var/lib/ceph/osd/ceph-{0,1,2,3,4,5}

And add the appropriate entries to /etc/fstab. Keep in mind that these need to increment for each successive server.

/dev/sdc /var/lib/ceph/osd/ceph-0 xfs defaults 0 0
/dev/sde /var/lib/ceph/osd/ceph-1 xfs defaults 0 0
/dev/sdd /var/lib/ceph/osd/ceph-2 xfs defaults 0 0
/dev/sdf /var/lib/ceph/osd/ceph-3 xfs defaults 0 0
/dev/sdg /var/lib/ceph/osd/ceph-4 xfs defaults 0 0
/dev/sdh /var/lib/ceph/osd/ceph-5 xfs defaults 0 0

For our second server, it would be..

[root@dfs-s002 osd]# mkdir -p /var/lib/ceph/osd/ceph-{6,7,8,9,10,11}

And the respective fstab:

/dev/sdc /var/lib/ceph/osd/ceph-6 xfs defaults 0 0
/dev/sde /var/lib/ceph/osd/ceph-7 xfs defaults 0 0
/dev/sdd /var/lib/ceph/osd/ceph-8 xfs defaults 0 0
/dev/sdf /var/lib/ceph/osd/ceph-9 xfs defaults 0 0
/dev/sdg /var/lib/ceph/osd/ceph-10 xfs defaults 0 0
/dev/sdh /var/lib/ceph/osd/ceph-11 xfs defaults 0 0

If we're adding these after the ceph cluster has been created, we need to inform the Ceph master about the new OSDs:

[root@dfs-m1 ceph]# ceph osd create 
6
[root@dfs-m1 ceph]# ceph osd create
7
[root@dfs-m1 ceph]# ceph osd create
8
[root@dfs-m1 ceph]# ceph osd create
9
[root@dfs-m1 ceph]# ceph osd create
10
[root@dfs-m1 ceph]# ceph osd create
11

We'll also need to create the OSD filesystem on the node.

[root@dfs-s002 ceph]# ceph-osd -i 6 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 7 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 8 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 9 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 10 --mkfs
[root@dfs-s002 ceph]# ceph-osd -i 11 --mkfs

or, simply

[root@dfs-s002 ceph]# for i in {6..11}; do ceph-osd -i $i --mkfs; done

Finally, start the service via

[root@dfs-001 ~]# /etc/init.d/ceph start

See more: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#adding-an-osd-manual

Creating a dedicated journal device

You'll need to create a journal device and mount it to some directory such as /ceph/journal, and then have your OSDs put their journal on that disk. Here's what needs to be in your Ceph.conf:

[osd]
        osd journal size = 1000
        filestore xattr use omap = true
        osd journal = /ceph/journal/osd.$id.journal  ;only if you want put journal on a dedicated disk, like ssd
        journal dio = true
        journal aio = true

Configuring the Ceph Monitors

Each node also runs a monitor. The appropriate directory will need to be created on each node. The order is determiend by the ceph.conf file.

[root@dfs-s001 ~]# mkdir -p /var/lib/ceph/mon/ceph-b
[root@dfs-s002 ~]# mkdir -p /var/lib/ceph/mon/ceph-c
[root@dfs-s003 ~]# mkdir -p /var/lib/ceph/mon/ceph-d
[root@dfs-s004 ~]# mkdir -p /var/lib/ceph/mon/ceph-e

Configuring the Ceph filesystem (kernel module)

First you'll need to install the 3.7.8-UL2 kernel and

modprobe ceph

Then, mount the ceph filesystem:

[root@dfs-m1 ~]# mount -t ceph 10.1.6.254:6789:/ /mnt

Rough Ceph benchmarks

date	write (MB/s)	read (MB/s)	kernel	underlying filesystem	Topology	Notes

to be updated

Troubleshooting Ceph

http://eu.ceph.com/docs/v0.47.2/ops/manage/failures/osd/ https://wikitech.wikimedia.org/wiki/Ceph

Identifying slow nodes

for id in {0..59}; do ceph osd tell $id bench; done

then grep "bench" in /var/log/ceph/ceph.log

If they're ~6MB/s, do this:

sdparm --set=WCE /dev/sd{c,d,e,f,g,h} --sav

--+++ pg errors

If you see errors like this:

[root@dfs-m1 ~]# ceph -s
   health HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 3 pgs stuck unclean; recovery  recovering 1 o/s, 2402B/s
   monmap e1: 1 mons at {a=10.1.6.254:6789/0}, election epoch 2, quorum 0 a
   osdmap e84: 24 osds: 24 up, 24 in
    pgmap v201: 192 pgs: 189 active+clean, 3 peering; 9518 bytes data, 21718 MB used, 14639 GB / 14660 GB avail;  recovering 1 o/s, 2402B/s
   mdsmap e4: 1/1/1 up {0=a=up:active}

This is probably because you added a new OSDs and Ceph is rebalancing.

Likewise, if you see this:

[root@dfs-m1 ~]# ceph -s
   health HEALTH_WARN  192 pgs stuck inactive; 192 pgs stuck unclean

You've likely just set up Ceph and need to add some OSDs.

Possibly bad disks:

[root@dfs-m1 distribution]# cat /var/log/ceph/ceph.log | grep "osd.[0-9][0-9]" | awk '{print $3}' | ./distribution
Val   |Ct (Pct)        Histogram
osd.34|203552 (69.69%) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
osd.23|14208 (4.86%)   ++++
osd.12|10675 (3.65%)   +++
osd.20|8803 (3.01%)    +++
osd.14|8120 (2.78%)    +++
osd.57|5276 (1.81%)    ++
osd.35|4716 (1.61%)    ++
osd.58|3823 (1.31%)    ++
osd.54|3619 (1.24%)    +
osd.33|3083 (1.06%)    +
osd.32|2573 (0.88%)    +
osd.11|2215 (0.76%)    +
mon.0 |2169 (0.74%)    +
osd.13|1898 (0.65%)    +
osd.30|1770 (0.61%)    +

Too few PGs

[root@dfs-s010 ceph-59]# ceph -w cluster 996b1c9b-ea75-4d39-a30a-dc212de4b8ae health HEALTH_WARN too few pgs per osd (16 < min 20) monmap e1: 1 mons at {a=10.1.6.253:6789/0}, election epoch 2, quorum 0 a osdmap e96: 60 osds: 12 up, 12 in pgmap v145: 192 pgs, 3 pools, 0 bytes data, 0 objects

1. 1. 1. MB used, 8365 GB / 8377 GB avail 192 active+clean

You'll need to increase both the pg_num and pgp_num for each pool. To list pools:

# rados lspools
data
metadata
rbd

To create the placement groups:

# ceph osd pool set rbd pg_num 1200
# ceph osd pool set rbd pgp_num 1200
# ceph osd pool set data pg_num 1200
# ceph osd pool set data pgp_num 1200
# ceph osd pool set metadata pg_num 1200
# ceph osd pool set metadata pgp_num 1200

Mon unresponsive

I ran into a problem where the Ceph mon became unresponsive after trying to add a second mon. Here's a bit of surgery that you can use to fix the problem.

First, set the cluster to "no out" mode, such that it wont start rebalancing operations when we bring OSDs down.

# ceph osd set noout

Next, bring down the Ceph services

# /etc/init.d/ceph stop

Once Ceph is down and the monitor service is stopped, extract the last known good monitor map. This assumes the monitor's ID is "a":

ceph-mon -i a --extract-monmap /tmp/monmap

Now you'll need to use the monmaptool to print the contents of the map:

# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 2
fsid 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
last_changed 2013-12-27 13:42:00.400911
created 2013-10-03 12:43:16.858425
0: 10.1.5.67:6789/0 mon.a
1: 192.170.227.122:6789/0 mon.b

Backup the monmap, and then remove the faulty monitor (in this case, mon b) again with monmap tool:

# cp /tmp/monmap /tmp/monmap-backup
# monmaptool --print monmap --rm b
monmaptool: monmap file /tmp/monmap
epoch 2
fsid 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
last_changed 2013-12-27 13:42:00.400911
created 2013-10-03 12:43:16.858425
0: 10.1.5.67:6789/0 mon.a

The faulty monitor has been removed. Go ahead and re-inject the monitor map into the mon daemon and restart Ceph:

# ceph-mon -i a --inject-monmap /tmp/monmap
# /etc/init.d/ceph start

Finally, check the cluster health:

# ceph -s
    cluster 35f60a8c-1a56-47f4-abb1-69e80b57dd5d
     health HEALTH_OK

Example config

[global]
        auth cluster required = none
        auth service required = none
        auth client required = none

[osd]
        osd journal size = 1000
        filestore xattr use omap = true

[mon.a]
        host = dfs-m1
        mon addr = 10.1.6.254

[osd.0]
        host = dfs-s001

[osd.1]
        host = dfs-s001

[osd.2]
        host = dfs-s001

[osd.3]
        host = dfs-s001

[osd.4]
        host = dfs-s001

[osd.5]
        host = dfs-s001

[osd.6]
        host = dfs-s002

[osd.7]
        host = dfs-s002

[osd.8]
        host = dfs-s002

[osd.9]
        host = dfs-s002

[osd.10]
        host = dfs-s002

[osd.11]
        host = dfs-s002

[mds.a]
        host = dfs-m1

Wiping out Ceph and starting over

These directories will need to be wiped out if you plan to restart fresh with Ceph:

/var/lib/ceph/osd/ceph-{0..N} <-- OSD filesystem mounts on workers

--++ Benchmarking Ceph

Here's a small script I wrote to benchmark Ceph's writes for a variety of thread counts:

#!/bin/sh
# Small benchmarking script for Ceph

# We'll want to drop caches any time we start a new benchmark.
function drop_caches {
echo 3 > /proc/sys/vm/drop_caches
}

function benchy_write {
  echo "======================================================================"
  echo "WRITES for ${1-300} sec"
  echo "threads,sec,Cur_ops,started,finished,avg_MB/s,cur_MB/s,last_lat,avg_lat"
  echo "======================================================================"
  for i in {1,2,4,8,16,32}; do
    drop_caches
    rados bench -p test ${1-300} write -t $i --no-cleanup | tail -n +5 | head -n +${1-300} | sed -e 's/^[ \t]*//' | tr -s ' ' | sed -e 's/ /,/g' | grep -v "^2013\|^sec" | while read line; do echo "$i,$line"; done
  done
  echo "======================================================================"
}

benchy_write

Other scripts

Recreate the disks and re-add them to Ceph after a new install: for i in $(mount | grep ceph- | cut -d'-' -f2 | awk '{print $1}'); do ceph osd create && ceph-osd -i $i --mkfs; done

Don't recreate the disks, but reinstall Ceph OSD:

/etc/init.d/ceph stop; for i in $(ls /var/lib/ceph/osd/) ; do cd /var/lib/ceph/osd/$i && rm -rf *; done; for i in $(ls /var/lib/ceph/osd | grep ceph | cut -d'-' -f2); do ceph-osd -i $i --mkfs; done

-- LincolnBryant - 22 Jul 2013

Topic revision: r27 - 03 Jan 2014, LincolnBryant

Main

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback