SWITCH Cloud Blog


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.

http://docs.ceph.com/docs/master/man/8/rbd/

In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    flags:
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder

Conclusion

Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.

 


Adding 60 Terabytes to a Ceph Cluster

[Note: This post was republished from the now-defunct “petablog”]

BCC – an Experiment that “Escaped the Lab”

Starting in Fall 2012, we built a small prototype “cloud” consisting of about ten commodity servers running OpenStack and Ceph.  That project was called Building Cloud Competence (BCC), and the primary intended purpose was to acquire experience running such systems.  We worked with some “pilot” users, both external (mostly researchers) and internal (“experimental” services).  As these things go, experiments become beta tests, and people start relying on them… so this old BCC cluster now supports several visible applications such as SWITCH’s SourceForge mirror, the SWITCHdrive sync & share service, as well as SWITCHtube, our new video distribution platform.  In particular, SWITCHtube uses our “RadosGW” service (similar to Amazon’s S3) to stream HTML5 video directly from our Ceph storage system.

Our colleagues from the Interaction Enabling team would like to enhance SWITCHtube by adding more of the content that has been produced by users of the SWITCHcast lecture recording system.  This could amount to 20-30TB of new data on our Ceph cluster.  Until last week, the cluster consisted of fifty-three 3TB disks distributed across eight hosts, for a total raw storage capacity of 159TB, corresponding to 53TB of usable storage given the three-way replication that Ceph uses by default.  That capacity was already used to around 40%.  In order to accomodate the expected influx of data, we decided to upgrade capacity by adding new disks.

IMG_0543_400x300Since we bought the original disks less than two years ago, the maximum capacity for this type of disks – the low-power/low-cost/low-performance disks intended for “home NAS” applications – has increased from 3TB to 6TB.  So by adding just ten new disks, we could increase total capacity by almost 38%.  We found that for a modest investment, we could significantly extend the usable lifetime of the cluster.  Our friendly neighborhood hardware store had 14 items in stock, so we quickly ordered ten and built them into our servers the next morning.

Rebalancing the Cluster

Now, the Ceph cluster adapts to changes such as disks being added or lost (e.g. by failing) by redistributing data across the cluster.  This is a great feature because it makes the infrastructure very easy to grow and very robust to failures.  It is also quite impressive to watch, because redistribution makes use of the entire capacity of the cluster.  Unfortunately this tends to have a noticeable impact on the performance as experienced by other users of the storage system, in particular when writing to storage.  So in order to minimize annoyance to users, we scheduled the integration of the new disks for Friday late in the afternoon.

ceph-insert-6TB-48h

We use Graphite for performance monitoring of various aspects of the BCC cluster.  Here is one of the Ceph-related graphs showing what happened when the disks were added to the cluster as new “OSDs” (Object Storage Daemons).  The graph shows, for each physical disk server in the Ceph cluster, the rate of data written to disk, summed up across all disks for a given server.  The grey, turqoise, and yellow graphs correspond to servers h4, h1s, and h0s, respectively.  These servers are the ones that got the new 6TB disks: h4 got 5, h1s got 3, and h0s got 2.

We can see that the process of reshuffling data took about 20 hours, starting shortly after 1700 on Friday, and terminating around 1330 on Saturday.  The rate of writing to the new disks exceeded a Gigabyte per second for several hours.

Throughput is limited by the speed at which the local filesystem can write to disks (in particular the new ones) over the 6 Gb/s SATA channels, and by how fast the data copies can be retrieved from the old disks.  As most of the replication is done across the network, that could also become a bottleneck.  But in our case each server has dual 10GE connections, so in this case the network supports more throughput for each server than the disks can handle.  Why does it get slower over time? I guess one reason is that writing to a fresh file system is faster than writing to one that already has data on it, but I’m not sure whether that’s sufficient explanation.

Outlook

Based upon the experiences from the BCC project, SWITCH decided to start building cloud infrastructure in earnest.  We secured extensible space in two university data center locations in Lausanne and Zurich, and started deploying new clusters in Spring 2014.  We are now in the process of fine-tuning the configuration in order to ensure reliable operation, scalability, and fast and future-proof networking.  These activities are supported by the CUS program P-2 “information scientifique” as project “SCALE”.  We hope to make OpenStack-based self-service VMs available to first external users this Fall.