SWITCH Cloud Blog

Leave a comment

Tuning Virtualized Network Node: multi-queue virtio-net

The infrastructure used by SWITCHengines is composed of about 100 servers. Each one uses two 10 Gb/s network ports. Ideally, a given instance (virtual machine) on SWITCHengines would be able to achieve 20 Gb/s of throughput when communicating with the rest of the Internet. In the real world, however, several bottlenecks limit this rate. We are working hard to address these bottlenecks and bring actual performance closer to the theoretical optimum.

An important bottleneck in our infrastructure is the network node for each region. All logical routers are implemented on this node, using Linux network namespaces and Open vSwitch (OVS). That means that all packets between the Internet and all the instances of the region need to pass through the node.

In our architecture, the OpenStack services run inside various virtual machines (“service VMs” or “pets”) on a dedicated set of redundant “provisioning” (or “prov”) servers. This is good for serviceability and reliability, but has some overhead—especially for I/O-intensive tasks such as network packet forwarding. Our network node is one of those service VMs.

In the original configuration, a single instance would never get more than about 2 Gb/s of throughput to the Internet when measured with iperf. What’s worse, the aggregate Internet throughput for multiple VMs was not much higher, which meant that a single high-traffic VM could easily “starve” all other VMs.

We had investigated many options for improving this situation: DVR, multiple network nodes, SR-IOV, DPDK, moving work to switches etc. But each of these methods has its drawbacks such as additional complexity (and thus potential for new and exciting bugs and hard-to-debug failure modes), lock-in, and in some cases, loss of features like IPv6 support. So we stayed with our inefficient but simple configuration that has worked very reliably for us so far.

Multithreading to the Rescue!

Our network node is a VM with multiple logical CPUs. But when running e.g. “top” during high network load, we noticed that only one (virtual) core was busy forwarding packets. So we started looking for a way to distribute the work over several cores. We found that we could achieve this by enabling three things:

Multi-queue virtio-net interfaces

Our service nodes run under libvirt/Qemu/KVM and use virtio-net network devices. These interfaces can be configured to expose multiple queues. Here is an example of an interface definition in libvirt XML syntax which has been configured for eight queues:

 <interface type='bridge'>
   <mac address='52:54:00:e0:e1:15'/>
   <source bridge='br11'/>
   <model type='virtio'/>
   <driver name='vhost' queues='8'/>
   <virtualport type='openvswitch'/>

A good rule of thumb is to set the number of queues to the number of (virtual) CPU cores of the system.

Multi-threaded forwarding in the network node VM

Within the VM, kernel threads need to be allocated to the interface queues. This can be achieved using ethtool -L:

ethtool -L eth3 combined 8

This should be done during interface initialization, for example in a “pre-up” action in /etc/network/interfaces. But it seems to be possible to change this configuration on a running interface without disruption.

Recent version of the Open vSwitch datapath

Much of the packet forwarding on the network node is performed by OVS. Its “datapath” portion is integrated into the Linux kernel. Our systems normally run Ubuntu 14.04, which includes the Linux 3.13 kernel. The OVS kernel module isn’t included with this package, but is installed separately from the openvswitch-datapath-dkms package, which corresponds to the relatively old OVS version 2.0.2. Although the OVS kernel datapath is supposed to have been multi-threaded since forever, we found that in our setup, upgrading to a newer kernel is vital for getting good (OVS) network performance.

The current Ubuntu 16.04.1 LTS release includes a fairly new Linux kernel based on 4.4. That kernel also has the OVS datapath module included by default, so that the separate DKMS package is no longer necessary. Unfortunately we cannot upgrade to Ubuntu 16.04 because that would imply upgrading all OpenStack packages to OpenStack “Mitaka”, and we aren’t quite ready for that. But thankfully, Canonical makes newer kernel packages available for older Ubuntu releases as part of their “hardware enablement” effort, so it turns out to be very easy to upgrade 14.04 to the same new kernel:

sudo apt-get install -y --install-recommends linux-generic-lts-xenial

And after a reboot, the network node should be running a fresh Linux 4.4 kernel with the OVS 2.5 datapath code inside.


A simple test is to run multiple netperf TCP_STREAM tests in parallel from a single bare-metal host to six VMs running on separate nova-compute nodes behind the network node.

Each run consists of six netperf TCP_STREAM measurements started in parallel, whose throughput values are added together. Each figure is the average over ten consecutive runs with identical configuration.

The network node VM is set up with 8 vCPUs, and the two interfaces that carry traffic are configured with 8 queues each. We vary the number of queues that are actually used using ethtool -L iface combined n. (Note that even the 1-queue case does not exactly correspond to the original situation; but it’s the closest approximation that we had time to test.)

Network node running 3.13.0-95-generic kernel

1: 3.28 Gb/s
2: 3.41 Gb/s
4: 3.51 Gb/s
8: 3.57 Gb/s

Making use of multiple queues gives very little benefit.

Network node running 4.4.0-36-generic kernel

1: 3.23 Gb/s
2: 6.00 Gb/s
4: 8.02 Gb/s
8: 8.42 Gb/s (8.75 Gb/s with 12 target VMs)

Here we see that performance scales up nicely with multiple queues.

The maximum possible throughput in our setup is lower than 10 Gb/s, because the network node VM uses a single physical 10GE interface for both sides of traffic. And traffic between the network node and the hypervisors is sent encapsulated in VXLAN, which has some overhead.


Now we know how to enable multi-core networking for hand-configured service VMs (“pets”) such as our network node. But what about the VMs under OpenStack’s control?

Starting in Liberty, Nova supports multi-queue virtio-net. Our benchmarking cluster was still running Kilo, so we could not test that yet. But stay tuned!



SWITCHengines Under the Hood: Basic IPv6 Configuration

My last post, IPv6 Finally Arriving on SWITCHengines, described what users of our IaaS offering can expect from our newly introduced IPv6 support: Instances using the shared default network (“private”) will get publicly routable IPv6 addresses.

This post explains how we set this up, and why we decided to go this route.  We hope that this is interesting to curious SWITCHengines users, and useful for other operators of OpenStack infrastructure.

Before IPv6: Neutron, Tenant Networks and Floating IP

[Feel free to skip this section if you are familiar with Tenant Networks and Floating IPs.]

SWITCHengines uses Neutron, the current generation of OpenStack Networking.  Neutron supports user-definable networks, routers and additional “service” functions such as Load Balancers or VPN gateways.  In principle, every user can build her own network (or several) isolated from the other tenants.  There is a default network accessible to all tenants.  It is called private, which I find quite confusing because it is totally not private, but shared between all the tenants.  But it has a range of private (in the sense of RFC 1918) IPv4 addresses—a subnet in OpenStack terminology—that is used to assign “fixed” addresses to instances.

There is another network called public, which provides global connectivity.  Users cannot connect instances to it directly, but they can use Neutron routers (which include NAT functionality) to route between the private (RFC 1918) addresses of their instances on tenant networks (whether the shared “private” or their own) and the public network, and by extension, the Internet.  By default, they get “outbound-only” connectivity using 1:N NAT, like users behind a typical broadband router.  But they can also request a Floating IP, which can be associated with a particular instance port.  In this case, a 1:1 NAT provides both outbound and inbound connectivity to the instance.

The router between the shared private network and the external public network was provisioned by us; it is called private-router.  Users who build their own tenant networks and want to connect them with the outside world need to set up their own routers.

This is a fairly standard setup for OpenStack installations, although some operators, especially in the “public cloud” business, forgo private addresses and NAT, and let customers connect their VMs directly to a network with publicly routable addresses.  (Sometimes I wish we’d have done that when we built SWITCHengines—but IPv4 address conservation arguments were strong in our minds at the time.  Now it seems hard to move to such a model for IPv4.  But let’s assume that IPv6 will eventually displace IPv4, so this will become moot.)

Adding IPv6: Subnet, router port, return route—that’s it

So at the outset, we have

  • a shared internal network called private
  • a provider network with Internet connectivity called public
  • a shared router between private and public called private-router

We use the “Kilo” (2015.1) version of OpenStack.

As another requirement, the “real” network underlying the public network (in our case a VLAN) needs connectivity to the IPv6 Internet.

Create an IPv6 Subnet with the appropriate options

And of course we need a fresh range of IPv6 addresses that we can route on the Internet.  A single /64 will be sufficient.  We use this to define a new Subnet in Neutron:

neutron subnet-create --ip-version 6 --name private-ipv6 \
  --ipv6-ra-mode dhcpv6-stateless --ipv6-address-mode dhcpv6-stateless \
  private 2001:620:5ca1:80f0::/64

Note that we use dhcpv6-stateless for both ra-mode and address-mode.  This will actually use SLAAC (stateless address autoconfiguration) and router advertisements to configure IPv6 on the instance.  Stateless DHCPv6 could be used to convey information such as name server addresses, but I don’t think we’re actively using that now.

We should now see a radvd process running in an appropriate namespace on the network node.  And instances—both new and pre-existing!—will start to get IPv6 addresses if they are configured to use SLAAC, as is the default for most modern OSes.

Create a new port on the shared router to connect the IPv6 Subnet

Next, we need to add a port to the shared private-router that connects this new subnet with the outside world via the public network:

neutron router-interface-add private-router private-ipv6

Configure a return route on each upstream router

Now the outside world also needs a route back to our IPv6 subnet.  The subnet is already part of a larger aggregate that is routed toward our two upstream routers.  It is sufficient to add a static route for our subnet on each of them.  But where do we point that route to, i.e. what should be the “next hop”? We use the link-local address of the external (gateway) port of our Neutron router, which we can find out by looking inside the namespace for the router on the network node.  Our router private-router has UUID 2b8d1b4f-1df1-476a-ab77-f69bb0db3a59.  So we can run the following command on the network node:

$ sudo ip netns exec qrouter-2b8d1b4f-1df1-476a-ab77-f69bb0db3a59 ip -6 addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
55: qg-2d73d3fb-f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80fd:f816:3eff:fe00:30d7/64 scope global dynamic
 valid_lft 2591876sec preferred_lft 604676sec
 inet6 fe80::f816:3eff:fe00:30d7/64 scope link
 valid_lft forever preferred_lft forever
93: qr-02b9a67d-24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80f0::1/64 scope global
 valid_lft forever preferred_lft forever
 inet6 fe80::f816:3eff:fe7d:755b/64 scope link
 valid_lft forever preferred_lft forever
98: qr-6aaf629f-19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 fe80::f816:3eff:feb6:85f4/64 scope link
 valid_lft forever preferred_lft forever

The port we’re looking for is the one whose name starts with qr-, the gateway port.  The address we’re looking for is the one starting with fe80:, the link-local address.

The “internal” subnet has address 2001:620:5ca1:80f0::/64, and VLAN 908 (ve908 in router-ese) is the VLAN that connects our network node to the upstream router.  So this is what we configure on each of our routers using the “industry-standard CLI”:

ipv6 route 2001:620:5ca1:80f0::/64 ve 908 fe80::f816:3eff:fe00:30d7

And we’re done! IPv6 packets can flow between instances on our private network and the Internet.

Coming up

Of course this is not the end of the story.  While our customers were mostly happy that they suddenly got IPv6, there are a few surprises that came up.  In a future episode, we’ll tell you more about them and how they can be addressed.


SDN/NFV Paradigm: Where the Industry meets the Academia

On Thursday, 16th of June 2016, the Software Defined Networking (SDN) Switzerland community met up for their SDN Workshop collocated with the Open Cloud Day at ZHAW, School of Engineering in Winterthur. It was the first time where the SDN workshop stood as a separate and dedicated track at a larger event, where we expected synergy between cloud computing and SDN topics, especially from the industry point of view. The participants were free to attend both, the Open Cloud Day main track and/or the SDN event.

The objectives of the SDN workshop were basically the same as the last time, to share knowledge, have hands on sessions, pushing best practices, implementations supporting operations, and presentations of current research topics and prototypes in SDN. This time I might have to say that the focus was on SDNization, sometimes called “Softwarization, which will have a deep impact on the techno-economic aspects. Examples of Softwarization are reducing costs on digitalizing and automation of processes as optimizing the usage of resources and creating new forms of coordination and also competition on the value-chain. The consequence of that is growing up new business models.

Up to 30 experts mostly from the Industry and academia took interest in the SDN session, (Slides). Lively discussions came up on SDN/NFV infrastructure deployments, service presentations and implementation forms, establishing micro service architectures, making cloud native applications more agile, and how to open the field to innovation.

Furthermore, two open source projects ONOS and CORD were introduced and discussed. These two conceptual architecture approaches set-up modularly and address scalability, high availability, performance, and a NB/SB abstraction, which allow communities providing research and production networks, and to establish their infrastructure leveraging open source software and white boxes. First SDN deployments powered by ONOS started at the beginning of 2015 with GEANT, GARR in Europe, Internet2 (USA), and AMLIGHT and FIU in  South-America. The target audience of such a Global SDN network deployment are RENs, network operators, and users. The motivation enforcing a Global SDN network are manifold, but can be summarized in (1) enabling network and service innovation and (2) learning and improvements through an agile collaboration model. Furthermore, new apps enabling network innovation – Castor (providing L2/L3 connectivity for SDX), SDN-IP (transforming a SDN into a transit IP network that means SDN AS using BGP for communication between the neighbors, and L3 connectivity without legacy router), SDX L2/L3 and VPLS can be offered.

Another trend was interesting to follow: Combining open source tools/technologies, where e.g Snabb switch technology meets vMX a full-featured carrier grade router within a Docker container for elaborating a high performance carrier grade lightweight 4over6. In a demonstration, a service directory licensing model that delivers a vADC (virtual Application Delivery Controller) as a service solution was presented.

The answer to the question “What can the network do for clouds” was shown by the contribution of Cumulus networks. With the Cumulus VX and OpenStack topology routing to the host allows server admins to utilize multipath capabilities on the server by using multiple uplinks and with this to take an active role without bounds on L2 networking. Various stages on cumulus network implementations are possible – from a full MLAG (Multi Layer Chassi Aggregation) Fabric with MLAG in the backbone, using LACP (Link Aggregation Control Protocol) from the servers and L2 connectivity with limited scalability, to a full Layer 3 fabric with high capability, scalability networking, and IP Fabric to the hosts with cumulus quagga improvements.

In context of Big-Data and applications like Hadoop and MapReduce in a 10-100 Gb/s data centre network, there are a lot of monitoring challenges. One of them is the requirement for faster and scalable network monitoring methods. Deep understanding on workflows and their communication patterns is essential for designing future data centers. Since the current existing monitoring tools show outdated software, high-speed resolution and non-intrusive monitoring in the data plane has to be addressed. Thus a high resolution network monitoring architecture, called zMon,  that targets large-scale data centre and IXP networks was presented and discussed.

The statement “A bird in the hand is worth two in the bush”…means boosting existing networks with SDN, and implies the question – Wouldn’t it be nice to be able to apply the SDN principles on top of existing network architectures? – The answer would be YES, but how to transfer a legacy network to a SDN enabled environment?…Three issues were discussed: (1) SDN requires to upgrade network devices, (2) to upgrade management systems and (3) “to upgrade” network operators. So using SDN would mean: (a) small investment that is providing benefits under partial deployments (mostly one single switch), (b) low risk means minimal impact on operational practices and being compatible with existing technologies, and last but not least  (c) high return, means to solve a timely problem. So two approaches were presented (A) Fibbing (intra-domain routing), an architecture that allows central control of router’s forwarding-table over distributed routing and (B) SDX (inter-domain routing), an approach that highlights flexible Internet policies, and open, flexible APIs. Special attention got Fibbing (lying the network), where flexibility, expressivity and manageability (the advantages of SDN) are combined. Technically spoken: Fake nodes and links will be introduced into an underlying link-state routing protocol (e.g. OSPF) so that routers compute their own forwarding tables based on the extended (faked/real) network topology. Flexible load balancing, traffic engineering and backup routes can be solved by Fibbing.

With “Supercharging SDN security” two project were presented: (1) A secure SDN architecture to confine damage due to a compromised controller and switch by isolating their processes and (2) an SDN security extension for path enforcement and path validation. In (1) the motivation of the discussed work was set-up upon the OpenDaylight Netconf’s vulnerability (DoS, Information disclosure, Topology spoofing),  the  ONOS deserialization bug (DoS), and privilege escalation on Cisco Application Policy Infrastructure Controller and the Cisco Nexus 9000 Series ACI Mode Switch. A question came up on what a secure SDN architecture should look like. The architecture building blocks are set-up on the principle of isolated virtual controllers and switches, connected by an SB API (OpenFlow Channel) among each other. As the isolation is per tenant it’s not possible to communicate through isolated environments thanks to TLS (Transport Layer Security). A prototype implementation was demonstrated. Stated that “a driver is lack of data plane accountability”, means no policy enforcement and data plane policy validation mechanism are in place. Thus an SDN security extension should support the enforcement of network path and reactively inspect the data-plane behavior. Path enforcement mechanism, path validation procedures were deployed and evaluation based on network latency and throughput.  

As a conclusion the SDN/NFV infrastructure deployments come closer to vendors, and ISPs understanding of SDN, e.g. focusing vDC, DC use cases. The question “What can the network do for Clouds”, became important on vendor specific implementations using OpenDaylight, Docker, OpenStack, etc., where services will be orchestrated and provided among a cloud network topology. Furthermore, there was a huge discussion on NB and SB APIs supporting control and data plane programmability. OpenFlow as the traditional SB API provides only a simple “match-action” paradigm and lacks on the function of stateful processing in the SDN data plane. So more flexibility pointed out in P4 – protocol independence (P4, a program language, which define how a switch process packets), target independence (P4, which allows to describe everything from high performance ASICs to virtual switches), and field reconfigurability (P4 that allows to reconfigure the progress of packets into switches) is needed and will be presented in future SDN approaches. Further, combining of open source tools/software with closed source/proprietary protocols does fasten the SDNization process, and brings together researcher, DevOps teams and the Industry.

The SDN Switzerland Group is an independent initiative introduced by SWITCH and the ICCLab (ZHAW) in 2013. The aim of the SDN Switzerland is to organize SDN workshops addressing topics from Research, academic ICTs (Operations), and the Industry (Implementation forms on SDN). This constitution allow us to bring together knowledge and use synergy of an interdisciplinary group for future steps also in collaboration.

IPv6 Finally Arriving on SWITCHengines

As you may have heard or noticed, the Internet is running out of addresses. It’s time to upgrade from the 35 years old IPv4 protocol, which doesn’t even have a single public address per human on the earth, to the brand new (?) IPv6, which offers enough addresses for every grain of sand in the known universe, or something like that.

SWITCH is a pioneer in IPv6 adoption, and has been supporting IPv6 on all network connections and most services in parallel with IPv4 (“dual stack”) for many years.

To our embarrassment, we hadn’t been able to integrate IPv6 support into SWITCHengines from the start. While OpenStack had some IPv6 support, the implementation wasn’t mature, and we didn’t know how to fit it into our network model in a user-friendly way.

IPv6: “On by default” and globally routable

About a month ago we took a big step to change this: IPv6 is now enabled by default for all instances on the shared internal network (“private”).  So if you have an instance running on SWITCHengines, and it isn’t connected to a tenant network of your own, then the instance probably has an IPv6 address right now, in addition to the IPv4 address(es) it always had.  Note that this is true even for instances that were created or last rebooted before we turned on IPv6. On Linux-derived systems you can check using ifconfig eth0 or ip -6 addr list dev eth0; if you see an address that starts with 2001:620:5ca1:, then your instance can speak IPv6.

Note that these IPv6 addresses are “globally unique” and routable, i.e. they are recognized by the general Internet.  In contrast, the IPv4 addresses on the default network are “private” and can only be used locally inside the cloud; communication with the general Internet requires Network Address Translation (NAT).

What you can do with an IPv6 address

Your instance will now be able to talk to other Internet hosts over IPv6. For example, try ping6 mirror.switch.ch or traceroute6 www.facebook.com. This works just like IPv4, except that only a subset of hosts on the Internet speaks IPv6 yet. Fortunately, this subset already includes important services and is growing.  Because IPv6 doesn’t need NAT, routing between your instances and the Internet is less resource-intensive and a tiny bit faster than with IPv4.

But you will also be able to accept connections from other Internet hosts over IPv6. This is different from before: To accept connections over IPv4, you need(ed) a separate public address, a Floating IP in OpenStack terminology.  So if you can get by with IPv6, for example because you only need (SSH or other) access from hosts that have IPv6, then you don’t need to reserve a Floating IP anymore.  This saves you not just work but also money—public IPv4 addresses are scarce, so we need to charge a small “rent” for each Floating IP reserved.  IPv6 addresses are plentiful, so we don’t charge for them.

But isn’t this dangerous?

Instances are now globally reachable by default, but they are still protected by OpenStack’s Security Groups (corresponding to packet filters or access control lists).  The default Security Group only allows outbound connections: Your instance can connect to servers elsewhere, but attempts to connect to your instance will be blocked.  You have probably opened some ports such as TCP port 22 (for SSH) or 80 or 443 (for HTTP/HTTPS) by adding corresponding rules to your own Security Groups.  In these rules, you need to specify address “prefixes” specifying where you want to accept traffic from.  These prefixes can be IPv4 or IPv6—if you want to accept both, you need two rules.

If you want to accept traffic from anywhere, your rules will contain as the prefix. To accept IPv6 traffic as well, simply add identical rules with ::/0 as the prefix instead—this is the IPv6 version of the “global” prefix.

What about domain names?

These IPv6 addresses can be entered in the DNS using “AAAA” records. For Floating IPs, we provided pre-registered hostnames of the form fl-34-56.zhdk.cloud.switch.ch. We cannot do that in IPv6, because there are just too many possible addresses. If you require your IPv6 address to map back to a hostname, please let us know and we can add it manually.

OpenStack will learn how to (optionally) register such hostnames in the DNS automatically; but that feature was only added to the latest release (“Mitaka”), and it will be several months before we can deploy this in SWITCHengines.


We would like to also offer IPv6 connectivity to user-created “tenant networks”. Our version of OpenStack almost supports this, but it cannot be fully automated yet. If you need IPv6 on your non-shared network right now, please let us know via the normal support channel, and we’ll set something up manually. But eventually (hopefully soon), getting a globally routable IPv6 prefix for your network should be (almost) as easy as getting a globally routable Floating IP is now.

You can also expect services running on SWITCHengines (SWITCHdrive, SWITCHfilesender and more) to become exposed over IPv6 over the next couple of months. Stay tuned!

New version, new features

We are constantly working on SWITCHengines, updating, tweaking stuff. Most of the time, little of this process is visible to users, but sometimes we release features that make a difference in the user experience.

A major change was the upgrade to OpenStack Kilo that we did mid March. OpenStack is the software that powers our cloud, and it gets an update every 6 months. The releases are named alphabetically. Our clouds history started with the “Icehouse” release, moved to “Juno” and now we are on “Kilo”. Yesterday “Mitaka” was released, so we are 2 releases (or 12 months) behind.

Upgrading the cloud infrastructure is major work. Our goal is to upgrade “in place” with all virtual machines running uninterrupted during the upgrade. Other cloud operators install a new version on minimal hardware, then start to migrate the customer machines one by one to the new hardware, and converting the hypervisors one by one. This is certainly feasible, but it causes downtime – something we’d like to avoid.

Therefore we spend a lot of time, testing the upgrade path. The upgrade from “Icehouse” to “Juno”took over 6 months (first we needed to figure out how to do the upgrade in the first place, then had to implement and test it). The upgrade from “Juno” to “Kilo” then only took 4 months (with x-mas and New Year in it). Now we are working on the upgrade to “Liberty” which is planned to happen before June / July. This time, we plan to be even faster, because we are going to upgrade the many components of OpenStack individually. The just release “Mitaka” release should be done before “Newton” is release in October. Our plan is to be at most 6 months behind the official release schedule.

So what does Kilo bring you, the end user? A slightly different user interface, loads of internal changes and a few new major features:

There is also stuff coming in the next few weeks:

  • Access to the SWIFT object store
  • Backup of Volumes (that is something we are testing right now)
  • IPv6 addresses for virtual machines

We have streamlined the deployment process of changes – while we did releases once a week during the last year, we now can deploy new features as soon as they are finished and tested.


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.


In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder


Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.