SWITCH Cloud Blog

IPv6 Address Assignment in OpenStack

In an inquiry “IPv6 and Liberty (or Mitaka)” on the openstack mailing list,

Ken D’Ambrosio writes:
> Hey, all. I have a Liberty cloud, and decided for the heck of it to
> start dipping my toe into IPv6. I do have some confusion, however. I
> can choose between SLAAC, DHCPv6 stateful and DHCPv6 stateless — and
> I see some writeups on what they do, but I don’t understand what
> differentiates them. As far as I can tell, they all do pretty much
> the same thing, just with different pieces doing different things.
> E.g., the chart, found here
> (http://docs.openstack.org/liberty/networking-guide/adv-config-ipv6.html
> — page down a little) shows those three options, but it isn’t clear:
> * How to configure the elements involved
> * What they exactly do (e.g., “optional info”? What’s that?)
> * Why there even *are* different choices. Do they offer functionally
> different results?

SLAAC and DHCPv6-stateless use the same mechanism (SLAAC) to provide instances with IPv6 addresses. The only difference between them is that with DHCPv6-stateless, the instance can also use DHCPv6 requests to get other (than its own address) information such as nameserver addresses etc. So between SLAAC and DHCPv6-stateless, I would always prefer DHCPv6-stateless—it’s a strict superset in terms of functionality, and I don’t see any particular risks associated with it.

DHCPv6-stateful is a different beast: It will use DHCPv6 to give an instance its IPv6 address. DHCPv6 actually fits OpenStack’s model better than SLAAC.

Why DHCPv6-Stateful Fits OpenStack Better

OpenStack (Nova) sees it as part of its job to control the IP address(es) that an instance uses. In IPv4 it uses DHCP (always did). DHCP assigns complete addresses—which are under control of OpenStack. In IPv6, stateful DHCPv6 would be the equivalent.

SLAAC is different in that the node (instance) actually chooses its address based on information it gets from the router. The most common method is that the node uses an “EUI-64” address as the local part (host ID) of the address. The EUI-64 is derived from the MAC address by a fixed algorithm. This can work with OpenStack because OpenStack controls the MAC addresses too, and can thus “guess” what IPv6 address an instance will auto-configure on a given network. You see how this is a little less straightforward than OpenStack simply telling the instance what IPv6 address it should use.

In practice, OpenStack’s guessing fails when an instance uses other methods to get the local part, for example “privacy addresses” according to RFC 4941. These will lead to conflicts with OpenStack’s built-in anti-spoofing filters. So such mechanisms need to be disabled when SLAAC is used under OpenStack (including under “DHCPv6-stateless”).

Why we Use SLAAC/DHCPv6-Stateless Anyway

Unfortunately, most GNU/Linux distributions don’t support Stateful DHCPv6 “out of the box” today.

Because we want our users to use unmodified operating systems images and still get usable IPv6, we have grudgingly decided to use DHCPv6-stateless. For configuration information, see SWITCHengines Under the Hood: Basic IPv6 Configuration.

If you decide to go for DHCPv6-stateful, then there’s a Web page that explains how to enable it client-side for a variety of GNU/Linux distributions.

It would be nice if all systems honored the “M” (Managed) flag in Router Advertisements and would use DHCPv6 if it is set, otherwise SLAAC.

[This is an edited version of my response, which I wasn’t sure I was allowed to post because I use GMANE to read the list– SL.]

Tuning Virtualized Network Node: multi-queue virtio-net

The infrastructure used by SWITCHengines is composed of about 100 servers. Each one uses two 10 Gb/s network ports. Ideally, a given instance (virtual machine) on SWITCHengines would be able to achieve 20 Gb/s of throughput when communicating with the rest of the Internet. In the real world, however, several bottlenecks limit this rate. We are working hard to address these bottlenecks and bring actual performance closer to the theoretical optimum.

An important bottleneck in our infrastructure is the network node for each region. All logical routers are implemented on this node, using Linux network namespaces and Open vSwitch (OVS). That means that all packets between the Internet and all the instances of the region need to pass through the node.

In our architecture, the OpenStack services run inside various virtual machines (“service VMs” or “pets”) on a dedicated set of redundant “provisioning” (or “prov”) servers. This is good for serviceability and reliability, but has some overhead—especially for I/O-intensive tasks such as network packet forwarding. Our network node is one of those service VMs.

In the original configuration, a single instance would never get more than about 2 Gb/s of throughput to the Internet when measured with iperf. What’s worse, the aggregate Internet throughput for multiple VMs was not much higher, which meant that a single high-traffic VM could easily “starve” all other VMs.

We had investigated many options for improving this situation: DVR, multiple network nodes, SR-IOV, DPDK, moving work to switches etc. But each of these methods has its drawbacks such as additional complexity (and thus potential for new and exciting bugs and hard-to-debug failure modes), lock-in, and in some cases, loss of features like IPv6 support. So we stayed with our inefficient but simple configuration that has worked very reliably for us so far.

Multithreading to the Rescue!

Our network node is a VM with multiple logical CPUs. But when running e.g. “top” during high network load, we noticed that only one (virtual) core was busy forwarding packets. So we started looking for a way to distribute the work over several cores. We found that we could achieve this by enabling three things:

Multi-queue virtio-net interfaces

Our service nodes run under libvirt/Qemu/KVM and use virtio-net network devices. These interfaces can be configured to expose multiple queues. Here is an example of an interface definition in libvirt XML syntax which has been configured for eight queues:

 <interface type='bridge'>
   <mac address='52:54:00:e0:e1:15'/>
   <source bridge='br11'/>
   <model type='virtio'/>
   <driver name='vhost' queues='8'/>
   <virtualport type='openvswitch'/>

A good rule of thumb is to set the number of queues to the number of (virtual) CPU cores of the system.

Multi-threaded forwarding in the network node VM

Within the VM, kernel threads need to be allocated to the interface queues. This can be achieved using ethtool -L:

ethtool -L eth3 combined 8

This should be done during interface initialization, for example in a “pre-up” action in /etc/network/interfaces. But it seems to be possible to change this configuration on a running interface without disruption.

Recent version of the Open vSwitch datapath

Much of the packet forwarding on the network node is performed by OVS. Its “datapath” portion is integrated into the Linux kernel. Our systems normally run Ubuntu 14.04, which includes the Linux 3.13 kernel. The OVS kernel module isn’t included with this package, but is installed separately from the openvswitch-datapath-dkms package, which corresponds to the relatively old OVS version 2.0.2. Although the OVS kernel datapath is supposed to have been multi-threaded since forever, we found that in our setup, upgrading to a newer kernel is vital for getting good (OVS) network performance.

The current Ubuntu 16.04.1 LTS release includes a fairly new Linux kernel based on 4.4. That kernel also has the OVS datapath module included by default, so that the separate DKMS package is no longer necessary. Unfortunately we cannot upgrade to Ubuntu 16.04 because that would imply upgrading all OpenStack packages to OpenStack “Mitaka”, and we aren’t quite ready for that. But thankfully, Canonical makes newer kernel packages available for older Ubuntu releases as part of their “hardware enablement” effort, so it turns out to be very easy to upgrade 14.04 to the same new kernel:

sudo apt-get install -y --install-recommends linux-generic-lts-xenial

And after a reboot, the network node should be running a fresh Linux 4.4 kernel with the OVS 2.5 datapath code inside.


A simple test is to run multiple netperf TCP_STREAM tests in parallel from a single bare-metal host to six VMs running on separate nova-compute nodes behind the network node.

Each run consists of six netperf TCP_STREAM measurements started in parallel, whose throughput values are added together. Each figure is the average over ten consecutive runs with identical configuration.

The network node VM is set up with 8 vCPUs, and the two interfaces that carry traffic are configured with 8 queues each. We vary the number of queues that are actually used using ethtool -L iface combined n. (Note that even the 1-queue case does not exactly correspond to the original situation; but it’s the closest approximation that we had time to test.)

Network node running 3.13.0-95-generic kernel

1: 3.28 Gb/s
2: 3.41 Gb/s
4: 3.51 Gb/s
8: 3.57 Gb/s

Making use of multiple queues gives very little benefit.

Network node running 4.4.0-36-generic kernel

1: 3.23 Gb/s
2: 6.00 Gb/s
4: 8.02 Gb/s
8: 8.42 Gb/s (8.75 Gb/s with 12 target VMs)

Here we see that performance scales up nicely with multiple queues.

The maximum possible throughput in our setup is lower than 10 Gb/s, because the network node VM uses a single physical 10GE interface for both sides of traffic. And traffic between the network node and the hypervisors is sent encapsulated in VXLAN, which has some overhead.


Now we know how to enable multi-core networking for hand-configured service VMs (“pets”) such as our network node. But what about the VMs under OpenStack’s control?

Starting in Liberty, Nova supports multi-queue virtio-net. Our benchmarking cluster was still running Kilo, so we could not test that yet. But stay tuned!



SWITCHengines Under the Hood: Basic IPv6 Configuration

My last post, IPv6 Finally Arriving on SWITCHengines, described what users of our IaaS offering can expect from our newly introduced IPv6 support: Instances using the shared default network (“private”) will get publicly routable IPv6 addresses.

This post explains how we set this up, and why we decided to go this route.  We hope that this is interesting to curious SWITCHengines users, and useful for other operators of OpenStack infrastructure.

Before IPv6: Neutron, Tenant Networks and Floating IP

[Feel free to skip this section if you are familiar with Tenant Networks and Floating IPs.]

SWITCHengines uses Neutron, the current generation of OpenStack Networking.  Neutron supports user-definable networks, routers and additional “service” functions such as Load Balancers or VPN gateways.  In principle, every user can build her own network (or several) isolated from the other tenants.  There is a default network accessible to all tenants.  It is called private, which I find quite confusing because it is totally not private, but shared between all the tenants.  But it has a range of private (in the sense of RFC 1918) IPv4 addresses—a subnet in OpenStack terminology—that is used to assign “fixed” addresses to instances.

There is another network called public, which provides global connectivity.  Users cannot connect instances to it directly, but they can use Neutron routers (which include NAT functionality) to route between the private (RFC 1918) addresses of their instances on tenant networks (whether the shared “private” or their own) and the public network, and by extension, the Internet.  By default, they get “outbound-only” connectivity using 1:N NAT, like users behind a typical broadband router.  But they can also request a Floating IP, which can be associated with a particular instance port.  In this case, a 1:1 NAT provides both outbound and inbound connectivity to the instance.

The router between the shared private network and the external public network was provisioned by us; it is called private-router.  Users who build their own tenant networks and want to connect them with the outside world need to set up their own routers.

This is a fairly standard setup for OpenStack installations, although some operators, especially in the “public cloud” business, forgo private addresses and NAT, and let customers connect their VMs directly to a network with publicly routable addresses.  (Sometimes I wish we’d have done that when we built SWITCHengines—but IPv4 address conservation arguments were strong in our minds at the time.  Now it seems hard to move to such a model for IPv4.  But let’s assume that IPv6 will eventually displace IPv4, so this will become moot.)

Adding IPv6: Subnet, router port, return route—that’s it

So at the outset, we have

  • a shared internal network called private
  • a provider network with Internet connectivity called public
  • a shared router between private and public called private-router

We use the “Kilo” (2015.1) version of OpenStack.

As another requirement, the “real” network underlying the public network (in our case a VLAN) needs connectivity to the IPv6 Internet.

Create an IPv6 Subnet with the appropriate options

And of course we need a fresh range of IPv6 addresses that we can route on the Internet.  A single /64 will be sufficient.  We use this to define a new Subnet in Neutron:

neutron subnet-create --ip-version 6 --name private-ipv6 \
  --ipv6-ra-mode dhcpv6-stateless --ipv6-address-mode dhcpv6-stateless \
  private 2001:620:5ca1:80f0::/64

Note that we use dhcpv6-stateless for both ra-mode and address-mode.  This will actually use SLAAC (stateless address autoconfiguration) and router advertisements to configure IPv6 on the instance.  Stateless DHCPv6 could be used to convey information such as name server addresses, but I don’t think we’re actively using that now.

We should now see a radvd process running in an appropriate namespace on the network node.  And instances—both new and pre-existing!—will start to get IPv6 addresses if they are configured to use SLAAC, as is the default for most modern OSes.

Create a new port on the shared router to connect the IPv6 Subnet

Next, we need to add a port to the shared private-router that connects this new subnet with the outside world via the public network:

neutron router-interface-add private-router private-ipv6

Configure a return route on each upstream router

Now the outside world also needs a route back to our IPv6 subnet.  The subnet is already part of a larger aggregate that is routed toward our two upstream routers.  It is sufficient to add a static route for our subnet on each of them.  But where do we point that route to, i.e. what should be the “next hop”? We use the link-local address of the external (gateway) port of our Neutron router, which we can find out by looking inside the namespace for the router on the network node.  Our router private-router has UUID 2b8d1b4f-1df1-476a-ab77-f69bb0db3a59.  So we can run the following command on the network node:

$ sudo ip netns exec qrouter-2b8d1b4f-1df1-476a-ab77-f69bb0db3a59 ip -6 addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
55: qg-2d73d3fb-f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80fd:f816:3eff:fe00:30d7/64 scope global dynamic
 valid_lft 2591876sec preferred_lft 604676sec
 inet6 fe80::f816:3eff:fe00:30d7/64 scope link
 valid_lft forever preferred_lft forever
93: qr-02b9a67d-24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80f0::1/64 scope global
 valid_lft forever preferred_lft forever
 inet6 fe80::f816:3eff:fe7d:755b/64 scope link
 valid_lft forever preferred_lft forever
98: qr-6aaf629f-19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 fe80::f816:3eff:feb6:85f4/64 scope link
 valid_lft forever preferred_lft forever

The port we’re looking for is the one whose name starts with qr-, the gateway port.  The address we’re looking for is the one starting with fe80:, the link-local address.

The “internal” subnet has address 2001:620:5ca1:80f0::/64, and VLAN 908 (ve908 in router-ese) is the VLAN that connects our network node to the upstream router.  So this is what we configure on each of our routers using the “industry-standard CLI”:

ipv6 route 2001:620:5ca1:80f0::/64 ve 908 fe80::f816:3eff:fe00:30d7

And we’re done! IPv6 packets can flow between instances on our private network and the Internet.

Coming up

Of course this is not the end of the story.  While our customers were mostly happy that they suddenly got IPv6, there are a few surprises that came up.  In a future episode, we’ll tell you more about them and how they can be addressed.

IPv6 Finally Arriving on SWITCHengines

As you may have heard or noticed, the Internet is running out of addresses. It’s time to upgrade from the 35 years old IPv4 protocol, which doesn’t even have a single public address per human on the earth, to the brand new (?) IPv6, which offers enough addresses for every grain of sand in the known universe, or something like that.

SWITCH is a pioneer in IPv6 adoption, and has been supporting IPv6 on all network connections and most services in parallel with IPv4 (“dual stack”) for many years.

To our embarrassment, we hadn’t been able to integrate IPv6 support into SWITCHengines from the start. While OpenStack had some IPv6 support, the implementation wasn’t mature, and we didn’t know how to fit it into our network model in a user-friendly way.

IPv6: “On by default” and globally routable

About a month ago we took a big step to change this: IPv6 is now enabled by default for all instances on the shared internal network (“private”).  So if you have an instance running on SWITCHengines, and it isn’t connected to a tenant network of your own, then the instance probably has an IPv6 address right now, in addition to the IPv4 address(es) it always had.  Note that this is true even for instances that were created or last rebooted before we turned on IPv6. On Linux-derived systems you can check using ifconfig eth0 or ip -6 addr list dev eth0; if you see an address that starts with 2001:620:5ca1:, then your instance can speak IPv6.

Note that these IPv6 addresses are “globally unique” and routable, i.e. they are recognized by the general Internet.  In contrast, the IPv4 addresses on the default network are “private” and can only be used locally inside the cloud; communication with the general Internet requires Network Address Translation (NAT).

What you can do with an IPv6 address

Your instance will now be able to talk to other Internet hosts over IPv6. For example, try ping6 mirror.switch.ch or traceroute6 www.facebook.com. This works just like IPv4, except that only a subset of hosts on the Internet speaks IPv6 yet. Fortunately, this subset already includes important services and is growing.  Because IPv6 doesn’t need NAT, routing between your instances and the Internet is less resource-intensive and a tiny bit faster than with IPv4.

But you will also be able to accept connections from other Internet hosts over IPv6. This is different from before: To accept connections over IPv4, you need(ed) a separate public address, a Floating IP in OpenStack terminology.  So if you can get by with IPv6, for example because you only need (SSH or other) access from hosts that have IPv6, then you don’t need to reserve a Floating IP anymore.  This saves you not just work but also money—public IPv4 addresses are scarce, so we need to charge a small “rent” for each Floating IP reserved.  IPv6 addresses are plentiful, so we don’t charge for them.

But isn’t this dangerous?

Instances are now globally reachable by default, but they are still protected by OpenStack’s Security Groups (corresponding to packet filters or access control lists).  The default Security Group only allows outbound connections: Your instance can connect to servers elsewhere, but attempts to connect to your instance will be blocked.  You have probably opened some ports such as TCP port 22 (for SSH) or 80 or 443 (for HTTP/HTTPS) by adding corresponding rules to your own Security Groups.  In these rules, you need to specify address “prefixes” specifying where you want to accept traffic from.  These prefixes can be IPv4 or IPv6—if you want to accept both, you need two rules.

If you want to accept traffic from anywhere, your rules will contain as the prefix. To accept IPv6 traffic as well, simply add identical rules with ::/0 as the prefix instead—this is the IPv6 version of the “global” prefix.

What about domain names?

These IPv6 addresses can be entered in the DNS using “AAAA” records. For Floating IPs, we provided pre-registered hostnames of the form fl-34-56.zhdk.cloud.switch.ch. We cannot do that in IPv6, because there are just too many possible addresses. If you require your IPv6 address to map back to a hostname, please let us know and we can add it manually.

OpenStack will learn how to (optionally) register such hostnames in the DNS automatically; but that feature was only added to the latest release (“Mitaka”), and it will be several months before we can deploy this in SWITCHengines.


We would like to also offer IPv6 connectivity to user-created “tenant networks”. Our version of OpenStack almost supports this, but it cannot be fully automated yet. If you need IPv6 on your non-shared network right now, please let us know via the normal support channel, and we’ll set something up manually. But eventually (hopefully soon), getting a globally routable IPv6 prefix for your network should be (almost) as easy as getting a globally routable Floating IP is now.

You can also expect services running on SWITCHengines (SWITCHdrive, SWITCHfilesender and more) to become exposed over IPv6 over the next couple of months. Stay tuned!

New version, new features

We are constantly working on SWITCHengines, updating, tweaking stuff. Most of the time, little of this process is visible to users, but sometimes we release features that make a difference in the user experience.

A major change was the upgrade to OpenStack Kilo that we did mid March. OpenStack is the software that powers our cloud, and it gets an update every 6 months. The releases are named alphabetically. Our clouds history started with the “Icehouse” release, moved to “Juno” and now we are on “Kilo”. Yesterday “Mitaka” was released, so we are 2 releases (or 12 months) behind.

Upgrading the cloud infrastructure is major work. Our goal is to upgrade “in place” with all virtual machines running uninterrupted during the upgrade. Other cloud operators install a new version on minimal hardware, then start to migrate the customer machines one by one to the new hardware, and converting the hypervisors one by one. This is certainly feasible, but it causes downtime – something we’d like to avoid.

Therefore we spend a lot of time, testing the upgrade path. The upgrade from “Icehouse” to “Juno”took over 6 months (first we needed to figure out how to do the upgrade in the first place, then had to implement and test it). The upgrade from “Juno” to “Kilo” then only took 4 months (with x-mas and New Year in it). Now we are working on the upgrade to “Liberty” which is planned to happen before June / July. This time, we plan to be even faster, because we are going to upgrade the many components of OpenStack individually. The just release “Mitaka” release should be done before “Newton” is release in October. Our plan is to be at most 6 months behind the official release schedule.

So what does Kilo bring you, the end user? A slightly different user interface, loads of internal changes and a few new major features:

There is also stuff coming in the next few weeks:

  • Access to the SWIFT object store
  • Backup of Volumes (that is something we are testing right now)
  • IPv6 addresses for virtual machines

We have streamlined the deployment process of changes – while we did releases once a week during the last year, we now can deploy new features as soon as they are finished and tested.



Upgrading a Ceph Cluster from 170 to 200 Disks, in One Image

The infrastructure underlying SWITCHengines includes two Ceph storage clusters, one in Lausanne and one in Zurich. The Zurich one (which notably serves SWITCHdrive) filled up over the past year. In December 2015 we acquired new servers to upgrade its capacity.

The upgrade involves the introduction of a new “leaf-spine” network architecture based on “whitebox” switches and Layer-3 (IP) routing to ensure future scalability. The pre-existing servers are still connected to the “old” network consisting of two switches and a single Layer 2 (Ethernet) domain.

First careful steps: 160→161→170

This change in network topology, and in particular the necessity to support both the old and new networks, caused us to be very careful when adding the new servers. The old cluster consisted of 160 Ceph OSDs, running on sixteen servers with ten 4TB hard disks each. We first added a single server with a single disk (OSD) and observed that it worked well. Then we added nine more OSDs on that first new server to bring the cluster total up to 170 OSDs. That also worked flawlessly.

Now for real: 170→200

As the next step, we added three new servers with ten disks each to the cluster at once, to bring the total OSD count from 170 to 200. We did this over the weekend because it causes a massive shuffling of data within the cluster, which slows down normal user I/O.

What should we expect to happen?

All in all, 28.77% of the existing storage objects in the system had to be migrated, corresponding to about 106 Terabytes of raw data. Most of the data movement is from the 170 old towards the 30 new disks.

How long should this take? One can make some back-of-the-envelope calculations. In a perfect world, writing 106 Terabytes to 30 disks, each of which sustains a write rate of 170 MB/s, would take around 5.8 hours. In Ceph, every byte written to an OSD has to go through a persistent “journal”, which is implemented using an SSD (flash-based solid-state disk). Our systems have two SSDs, each of which sustains a write rate of about 520 MB/s. Taking this bottleneck into account, the lower bound increases to 9.5 hours.

However this is still a very theoretical number, because it fails to include many other bottlenecks and types of overhead: disk controller and bus capacity limitations, processing overhead, network delays, reading data from the old disks etc. But most importantly, the Ceph cluster is actively used, and performs other maintenance tasks such as scrubbing, all of which competes with the movement of data to the new disks.

What do we actually see?

Here is a graph that illustrates what happens after the 30 new disks (OSDs) are added:


The y axis is the disk usage (as per output of the df command). The thin grey lines—there are 170 of them—correspond to each of the old OSDs. The thin red lines correspond to the 30 new OSDs. The blue line is the average disk usage across the old OSDs, the green line the average of the new OSDs. At the end of the process, the blue and green line should (roughly) meet.

So in practice, the process takes about 30 hours. In perspective, this is still quite fast and corresponds to a mean overall data-movement rate of about 1 GB/s or 8 Gbit/s. The green and blue lines show that the overall process seems very steady as it moves data from the old to the new OSDs.

Looking at the individual line “bundles”, we see that the process is not all that homogeneous. First, even within in the old line bundle, we see quite a bit of variation across the fill levels of the 170 disks. There is some variation at the outset, and it seems to get worse throughout the process. An interesting case is the lowest grey line—this is an OSD that has significantly less data that the others. I had hoped that the reshuffling would be an opportunity to make it approach the others (by shedding less data), but the opposite happened.

Anyway, a single under-utilized disk is not a big problem. Individual over-utilized disks are a problem, though. And we see that there is one OSD that has significantly higher occupancy. We can address this by explicit “reweighting” if and when this becomes a problem as the cluster fills up again. But then, we still have a couple of disk servers that we can add to the cluster over the coming months, to make sure that overall utilization remains in a comfortable range.


The graph above has been created using Graphite with the following graph definition:

 "target": [
 "height": 600

df_160+1+9+30The base data was collected by CollectD’s standard “df” plugin. zhdk00{44,51,52} are the new OSD servers, the others are the pre-existing ones.

Zooming out a bit shows the previous small extension steps mentioned above. As you see, adding nine disks doesn’t take much longer than adding a single one.