How we use Ceph at SWITCHengines
Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.
Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.
The performance problem
The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.
To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.
The commands to do this test are simple:
time rbd -p volumes create testname --size 1024 --image-format 2 rbd -p volumes info testname time rbd -p volumes rm testname
We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.
When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.
In our practical experience the time to delete an images decreased from several minutes to few seconds.
How to fix your OpenStack Juno installation
We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.
It was great, once the ceph.conf had the following two lines:
rbd default format = 2 rbd default features = 13
We could immediately create new images with object map as you see in the following output:
rbd image 'volume-<uuid>': size 20480 MB in 2560 objects order 23 (8192 kB objects) block_name_prefix: rbd_data.<prefix> format: 2 features: layering, exclusive, object map flags: parent: images/<uuid>@snap overlap: 1549 MB
We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.
We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.
To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.
Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.
At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder
- nova: https://github.com/zioproto/nova/tree/rbd_default_features
- glance_store: https://github.com/zioproto/glance_store/tree/rbd_features_0.1.8
- cinder: https://github.com/zioproto/cinder/tree/2014.2.4.backport-ceph-object-map
- Deb packages: http://ubuntu.mirror.cloud.switch.ch/engines/packages/
Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.
We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.