SWITCH Cloud Blog


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.

http://docs.ceph.com/docs/master/man/8/rbd/

In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    flags:
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder

Conclusion

Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.

 


Doing the right thing

I am returning from GridKA school, held annually at the KIT in Karlsruhe, where I co-hosted a two day workshop on installing OpenStack with Antonio Messina and Tyanko Alekseiev from the university of Zurich. (You can find the course notes and tutorials over on Github ). I don’t want to talk about the workshop so much (it was fun, out attendees were enthusiastic and we ended up with 8 complete OpenStack Grizzly clouds) as about the things that I experienced in the plenary sessions.
A bit of background on me: I joined SWITCH in April 2013 to work on the cloud. Before that, I had been self-employed, running my own companies, worked in a number of startups. I left academia in 1987 (without a degree) and returned to it in 2010 when I started (and  finished) a Masters in Science. Early on, friends and family told me that I should pursue an academic career, but I always wanted to prove myself in the commercial world… Well, being a bit closer to Academia was one of the reasons I joined SWITCH.
Back to GridKA: Presenting at the workshop, teaching and helping people with a complex technical software is something I have done quite a bit over the last 20 years, and something I’m quite good at (or so my students tell me). Nothing special, business as usual so to speak. 
There also was a plenary program with presentations from various people attending GridKA school. And although I only got to see a few of those due to my schedule, I was absolutely blown away by what I heard. Dr. Urban Liebel talked about  microscopes in Life Sciences – the ability to automatically take pictures of thousands of samples and use image recognition algorithms to classify them. He told about some of the results they discovered (Ibuprofen is doing damage to kidneys in children and increases the risk of kidney cancer, something science didn’t know until recently) now that they can investigate more samples faster.
José Luis Vázquez-Poletti in his talk “Cloud Computing: Expanding Humanity’s Limits to Planet Mars” talked about installing meterological sensors on Mars and how to use cloud computing ressources to help pinpoint the location of those sensors, once they had been deployed on Mars (basically by just dropping them down on the surface – ballistic entry). By looking at the transitions of Phobos, the moon of Mars, they are able to determine the location of the landed sensor.
Bendedikt Hegener from CERN talked about “Effective Programming and Multicore Computing” in which he described the trials and tribulations the CERN programmmers have to go through to parallelize 5 million lines of code in order to make the code take advantage of multi-core computers.
There were several other talks that I unfortunately didn’t have a chance to attend. The point of all this?
During those talks it hit me, that the work these scientists are doing is creating value on a much deeper level, than what most startups are creating. By working on the methods to automatically take microscopic pictures and analyse them, and increasing the throughput, these people directly work on the improvments of our living conditions. While the Mars and CERN experiments don’t seem to have immediate benefits, both space research and high energy physics have greatly contributed to our lives as well. A startup that is creating yet another social network, yet another photo sharing site, all with the intent of making investors happy (by generating loads of money) just doesn’t have the same impact on society.
My work here in SWITCH doesnt’t really have the same impact but I think that the work building Cloud infrastructure can help some researchers out there in Switzerland achieve their work more easily, faster or cheaper. In which case, my work at least contributed in a “supporting act”. What more could one want?


The PetaSolutions Blog

Welcome, dear reader, to the Peta Solutions Blog. “Another blog?”, you ask – yes very much so…

Let me start by providing a bit of background to who we are and what we are doing, this might help set the context for the diversity of things you are going to read here.

The Peta Solutions teams is located in the “Researchers and Lecturers” Division of SWITCH. Peta (of course) means big (bigger than Tera, anyway) and gives an indication of what we are working with:
Big things… We are here to help researchers with, shall we say, specialised needs in their ITC infrastructure. This started several years ago with Grid activities (several of our team members have been working in Grid related projects the last years), Cloud (we have been busy building our own cloud over the last months), SDN (Software Defined Networking), Network performance (our PERT – Performance Emergency Response Team stands by in case of performance problems) and more.

We work directly with researchers, and help them getting up to speed on these issues.

So what should you expect from this blog? We have a couple of ideas, some of us have blogged for quite a while, some are taking a wait and see attitude – the normal mix in other words.

We plan to talk about our experiences building, maintaining and operating infrastructure, maybe providing you with the crucical nugget of information that helps you solve a problem. We invite researchers we are working with to share their experiences. We sometimes will wax philosophically about things that are on our collective minds.

In any case, we are happy if all of this turns into a discourse: you are most welcome to respond.

Yours
Alessandra, Alessandro, Jens-Christian, Kurt, Placi, Rüdiger, Sam, Simon, Valery