SWITCH Cloud Blog


Leave a comment

SWITCHdrive Over IPv6

When we built the SWITCHdrive service on the OpenStack platform that was to become SWITCHengines, that platform didn’t really support IPv6 yet. But since Spring 2016 it does. This week, we enabled IPv6 in SWITCHdrive and performed some internal tests. Today around noon, we published its IPv6 address (“AAAA record”) in the DNS. We quickly saw around 5% of accesses use IPv6 instead of IPv4.

Screen Shot 2017-09-28 at 22.10.45

In the evening, this percentage climbed to about 14%. This shows the relatively good support for IPv6 on Swiss broadband (home) networks, notably by the good folks at Swisscom.

The lower percentage during office (and lecture, etc.) hours shows that the IPv6 roll-out to higher education campuses still has some way to go. Our SWITCHlan backbone has been running “dual-stack” (IPv4 and IPv6 in parallel) in production for more than 10 years, and most institutions have added IPv6 configuration to their connections to us. But campus networks are wonderfully complex, so getting IPv6 deployed to every network plug and every wireless access point is a daunting task. Some schools are almost there, including some large ones that don’t use SWITCHdrive—yet!?—so the 5% may underestimate the extent of the roll-out for the overall SWITCH community. The others will follow in their footsteps. They can count on the help of the community and benefit from IPv6 training courses organized by our colleagues in the security and network teams. Contact us if you need help!


IPv6 Address Assignment in OpenStack

In an inquiry “IPv6 and Liberty (or Mitaka)” on the openstack mailing list,

Ken D’Ambrosio writes:
> Hey, all. I have a Liberty cloud, and decided for the heck of it to
> start dipping my toe into IPv6. I do have some confusion, however. I
> can choose between SLAAC, DHCPv6 stateful and DHCPv6 stateless — and
> I see some writeups on what they do, but I don’t understand what
> differentiates them. As far as I can tell, they all do pretty much
> the same thing, just with different pieces doing different things.
> E.g., the chart, found here
> (http://docs.openstack.org/liberty/networking-guide/adv-config-ipv6.html
> — page down a little) shows those three options, but it isn’t clear:
> * How to configure the elements involved
> * What they exactly do (e.g., “optional info”? What’s that?)
> * Why there even *are* different choices. Do they offer functionally
> different results?

SLAAC and DHCPv6-stateless use the same mechanism (SLAAC) to provide instances with IPv6 addresses. The only difference between them is that with DHCPv6-stateless, the instance can also use DHCPv6 requests to get other (than its own address) information such as nameserver addresses etc. So between SLAAC and DHCPv6-stateless, I would always prefer DHCPv6-stateless—it’s a strict superset in terms of functionality, and I don’t see any particular risks associated with it.

DHCPv6-stateful is a different beast: It will use DHCPv6 to give an instance its IPv6 address. DHCPv6 actually fits OpenStack’s model better than SLAAC.

Why DHCPv6-Stateful Fits OpenStack Better

OpenStack (Nova) sees it as part of its job to control the IP address(es) that an instance uses. In IPv4 it uses DHCP (always did). DHCP assigns complete addresses—which are under control of OpenStack. In IPv6, stateful DHCPv6 would be the equivalent.

SLAAC is different in that the node (instance) actually chooses its address based on information it gets from the router. The most common method is that the node uses an “EUI-64” address as the local part (host ID) of the address. The EUI-64 is derived from the MAC address by a fixed algorithm. This can work with OpenStack because OpenStack controls the MAC addresses too, and can thus “guess” what IPv6 address an instance will auto-configure on a given network. You see how this is a little less straightforward than OpenStack simply telling the instance what IPv6 address it should use.

In practice, OpenStack’s guessing fails when an instance uses other methods to get the local part, for example “privacy addresses” according to RFC 4941. These will lead to conflicts with OpenStack’s built-in anti-spoofing filters. So such mechanisms need to be disabled when SLAAC is used under OpenStack (including under “DHCPv6-stateless”).

Why we Use SLAAC/DHCPv6-Stateless Anyway

Unfortunately, most GNU/Linux distributions don’t support Stateful DHCPv6 “out of the box” today.

Because we want our users to use unmodified operating systems images and still get usable IPv6, we have grudgingly decided to use DHCPv6-stateless. For configuration information, see SWITCHengines Under the Hood: Basic IPv6 Configuration.

If you decide to go for DHCPv6-stateful, then there’s a Web page that explains how to enable it client-side for a variety of GNU/Linux distributions.

It would be nice if all systems honored the “M” (Managed) flag in Router Advertisements and would use DHCPv6 if it is set, otherwise SLAAC.

[This is an edited version of my response, which I wasn’t sure I was allowed to post because I use GMANE to read the list– SL.]


Tuning Virtualized Network Node: multi-queue virtio-net

The infrastructure used by SWITCHengines is composed of about 100 servers. Each one uses two 10 Gb/s network ports. Ideally, a given instance (virtual machine) on SWITCHengines would be able to achieve 20 Gb/s of throughput when communicating with the rest of the Internet. In the real world, however, several bottlenecks limit this rate. We are working hard to address these bottlenecks and bring actual performance closer to the theoretical optimum.

An important bottleneck in our infrastructure is the network node for each region. All logical routers are implemented on this node, using Linux network namespaces and Open vSwitch (OVS). That means that all packets between the Internet and all the instances of the region need to pass through the node.

In our architecture, the OpenStack services run inside various virtual machines (“service VMs” or “pets”) on a dedicated set of redundant “provisioning” (or “prov”) servers. This is good for serviceability and reliability, but has some overhead—especially for I/O-intensive tasks such as network packet forwarding. Our network node is one of those service VMs.

In the original configuration, a single instance would never get more than about 2 Gb/s of throughput to the Internet when measured with iperf. What’s worse, the aggregate Internet throughput for multiple VMs was not much higher, which meant that a single high-traffic VM could easily “starve” all other VMs.

We had investigated many options for improving this situation: DVR, multiple network nodes, SR-IOV, DPDK, moving work to switches etc. But each of these methods has its drawbacks such as additional complexity (and thus potential for new and exciting bugs and hard-to-debug failure modes), lock-in, and in some cases, loss of features like IPv6 support. So we stayed with our inefficient but simple configuration that has worked very reliably for us so far.

Multithreading to the Rescue!

Our network node is a VM with multiple logical CPUs. But when running e.g. “top” during high network load, we noticed that only one (virtual) core was busy forwarding packets. So we started looking for a way to distribute the work over several cores. We found that we could achieve this by enabling three things:

Multi-queue virtio-net interfaces

Our service nodes run under libvirt/Qemu/KVM and use virtio-net network devices. These interfaces can be configured to expose multiple queues. Here is an example of an interface definition in libvirt XML syntax which has been configured for eight queues:

 <interface type='bridge'>
   <mac address='52:54:00:e0:e1:15'/>
   <source bridge='br11'/>
   <model type='virtio'/>
   <driver name='vhost' queues='8'/>
   <virtualport type='openvswitch'/>
 </interface>

A good rule of thumb is to set the number of queues to the number of (virtual) CPU cores of the system.

Multi-threaded forwarding in the network node VM

Within the VM, kernel threads need to be allocated to the interface queues. This can be achieved using ethtool -L:

ethtool -L eth3 combined 8

This should be done during interface initialization, for example in a “pre-up” action in /etc/network/interfaces. But it seems to be possible to change this configuration on a running interface without disruption.

Recent version of the Open vSwitch datapath

Much of the packet forwarding on the network node is performed by OVS. Its “datapath” portion is integrated into the Linux kernel. Our systems normally run Ubuntu 14.04, which includes the Linux 3.13 kernel. The OVS kernel module isn’t included with this package, but is installed separately from the openvswitch-datapath-dkms package, which corresponds to the relatively old OVS version 2.0.2. Although the OVS kernel datapath is supposed to have been multi-threaded since forever, we found that in our setup, upgrading to a newer kernel is vital for getting good (OVS) network performance.

The current Ubuntu 16.04.1 LTS release includes a fairly new Linux kernel based on 4.4. That kernel also has the OVS datapath module included by default, so that the separate DKMS package is no longer necessary. Unfortunately we cannot upgrade to Ubuntu 16.04 because that would imply upgrading all OpenStack packages to OpenStack “Mitaka”, and we aren’t quite ready for that. But thankfully, Canonical makes newer kernel packages available for older Ubuntu releases as part of their “hardware enablement” effort, so it turns out to be very easy to upgrade 14.04 to the same new kernel:

sudo apt-get install -y --install-recommends linux-generic-lts-xenial

And after a reboot, the network node should be running a fresh Linux 4.4 kernel with the OVS 2.5 datapath code inside.

Results

A simple test is to run multiple netperf TCP_STREAM tests in parallel from a single bare-metal host to six VMs running on separate nova-compute nodes behind the network node.

Each run consists of six netperf TCP_STREAM measurements started in parallel, whose throughput values are added together. Each figure is the average over ten consecutive runs with identical configuration.

The network node VM is set up with 8 vCPUs, and the two interfaces that carry traffic are configured with 8 queues each. We vary the number of queues that are actually used using ethtool -L iface combined n. (Note that even the 1-queue case does not exactly correspond to the original situation; but it’s the closest approximation that we had time to test.)

Network node running 3.13.0-95-generic kernel

1: 3.28 Gb/s
2: 3.41 Gb/s
4: 3.51 Gb/s
8: 3.57 Gb/s

Making use of multiple queues gives very little benefit.

Network node running 4.4.0-36-generic kernel

1: 3.23 Gb/s
2: 6.00 Gb/s
4: 8.02 Gb/s
8: 8.42 Gb/s (8.75 Gb/s with 12 target VMs)

Here we see that performance scales up nicely with multiple queues.

The maximum possible throughput in our setup is lower than 10 Gb/s, because the network node VM uses a single physical 10GE interface for both sides of traffic. And traffic between the network node and the hypervisors is sent encapsulated in VXLAN, which has some overhead.

Outlook

Now we know how to enable multi-core networking for hand-configured service VMs (“pets”) such as our network node. But what about the VMs under OpenStack’s control?

Starting in Liberty, Nova supports multi-queue virtio-net. Our benchmarking cluster was still running Kilo, so we could not test that yet. But stay tuned!

 

 


SWITCHengines Under the Hood: Basic IPv6 Configuration

My last post, IPv6 Finally Arriving on SWITCHengines, described what users of our IaaS offering can expect from our newly introduced IPv6 support: Instances using the shared default network (“private”) will get publicly routable IPv6 addresses.

This post explains how we set this up, and why we decided to go this route.  We hope that this is interesting to curious SWITCHengines users, and useful for other operators of OpenStack infrastructure.

Before IPv6: Neutron, Tenant Networks and Floating IP

[Feel free to skip this section if you are familiar with Tenant Networks and Floating IPs.]

SWITCHengines uses Neutron, the current generation of OpenStack Networking.  Neutron supports user-definable networks, routers and additional “service” functions such as Load Balancers or VPN gateways.  In principle, every user can build her own network (or several) isolated from the other tenants.  There is a default network accessible to all tenants.  It is called private, which I find quite confusing because it is totally not private, but shared between all the tenants.  But it has a range of private (in the sense of RFC 1918) IPv4 addresses—a subnet in OpenStack terminology—that is used to assign “fixed” addresses to instances.

There is another network called public, which provides global connectivity.  Users cannot connect instances to it directly, but they can use Neutron routers (which include NAT functionality) to route between the private (RFC 1918) addresses of their instances on tenant networks (whether the shared “private” or their own) and the public network, and by extension, the Internet.  By default, they get “outbound-only” connectivity using 1:N NAT, like users behind a typical broadband router.  But they can also request a Floating IP, which can be associated with a particular instance port.  In this case, a 1:1 NAT provides both outbound and inbound connectivity to the instance.

The router between the shared private network and the external public network was provisioned by us; it is called private-router.  Users who build their own tenant networks and want to connect them with the outside world need to set up their own routers.

This is a fairly standard setup for OpenStack installations, although some operators, especially in the “public cloud” business, forgo private addresses and NAT, and let customers connect their VMs directly to a network with publicly routable addresses.  (Sometimes I wish we’d have done that when we built SWITCHengines—but IPv4 address conservation arguments were strong in our minds at the time.  Now it seems hard to move to such a model for IPv4.  But let’s assume that IPv6 will eventually displace IPv4, so this will become moot.)

Adding IPv6: Subnet, router port, return route—that’s it

So at the outset, we have

  • a shared internal network called private
  • a provider network with Internet connectivity called public
  • a shared router between private and public called private-router

We use the “Kilo” (2015.1) version of OpenStack.

As another requirement, the “real” network underlying the public network (in our case a VLAN) needs connectivity to the IPv6 Internet.

Create an IPv6 Subnet with the appropriate options

And of course we need a fresh range of IPv6 addresses that we can route on the Internet.  A single /64 will be sufficient.  We use this to define a new Subnet in Neutron:

neutron subnet-create --ip-version 6 --name private-ipv6 \
  --ipv6-ra-mode dhcpv6-stateless --ipv6-address-mode dhcpv6-stateless \
  private 2001:620:5ca1:80f0::/64

Note that we use dhcpv6-stateless for both ra-mode and address-mode.  This will actually use SLAAC (stateless address autoconfiguration) and router advertisements to configure IPv6 on the instance.  Stateless DHCPv6 could be used to convey information such as name server addresses, but I don’t think we’re actively using that now.

We should now see a radvd process running in an appropriate namespace on the network node.  And instances—both new and pre-existing!—will start to get IPv6 addresses if they are configured to use SLAAC, as is the default for most modern OSes.

Create a new port on the shared router to connect the IPv6 Subnet

Next, we need to add a port to the shared private-router that connects this new subnet with the outside world via the public network:

neutron router-interface-add private-router private-ipv6

Configure a return route on each upstream router

Now the outside world also needs a route back to our IPv6 subnet.  The subnet is already part of a larger aggregate that is routed toward our two upstream routers.  It is sufficient to add a static route for our subnet on each of them.  But where do we point that route to, i.e. what should be the “next hop”? We use the link-local address of the external (gateway) port of our Neutron router, which we can find out by looking inside the namespace for the router on the network node.  Our router private-router has UUID 2b8d1b4f-1df1-476a-ab77-f69bb0db3a59.  So we can run the following command on the network node:

$ sudo ip netns exec qrouter-2b8d1b4f-1df1-476a-ab77-f69bb0db3a59 ip -6 addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
55: qg-2d73d3fb-f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80fd:f816:3eff:fe00:30d7/64 scope global dynamic
 valid_lft 2591876sec preferred_lft 604676sec
 inet6 fe80::f816:3eff:fe00:30d7/64 scope link
 valid_lft forever preferred_lft forever
93: qr-02b9a67d-24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80f0::1/64 scope global
 valid_lft forever preferred_lft forever
 inet6 fe80::f816:3eff:fe7d:755b/64 scope link
 valid_lft forever preferred_lft forever
98: qr-6aaf629f-19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 fe80::f816:3eff:feb6:85f4/64 scope link
 valid_lft forever preferred_lft forever

The port we’re looking for is the one whose name starts with qr-, the gateway port.  The address we’re looking for is the one starting with fe80:, the link-local address.

The “internal” subnet has address 2001:620:5ca1:80f0::/64, and VLAN 908 (ve908 in router-ese) is the VLAN that connects our network node to the upstream router.  So this is what we configure on each of our routers using the “industry-standard CLI”:

ipv6 route 2001:620:5ca1:80f0::/64 ve 908 fe80::f816:3eff:fe00:30d7

And we’re done! IPv6 packets can flow between instances on our private network and the Internet.

Coming up

Of course this is not the end of the story.  While our customers were mostly happy that they suddenly got IPv6, there are a few surprises that came up.  In a future episode, we’ll tell you more about them and how they can be addressed.


IPv6 Finally Arriving on SWITCHengines

As you may have heard or noticed, the Internet is running out of addresses. It’s time to upgrade from the 35 years old IPv4 protocol, which doesn’t even have a single public address per human on the earth, to the brand new (?) IPv6, which offers enough addresses for every grain of sand in the known universe, or something like that.

SWITCH is a pioneer in IPv6 adoption, and has been supporting IPv6 on all network connections and most services in parallel with IPv4 (“dual stack”) for many years.

To our embarrassment, we hadn’t been able to integrate IPv6 support into SWITCHengines from the start. While OpenStack had some IPv6 support, the implementation wasn’t mature, and we didn’t know how to fit it into our network model in a user-friendly way.

IPv6: “On by default” and globally routable

About a month ago we took a big step to change this: IPv6 is now enabled by default for all instances on the shared internal network (“private”).  So if you have an instance running on SWITCHengines, and it isn’t connected to a tenant network of your own, then the instance probably has an IPv6 address right now, in addition to the IPv4 address(es) it always had.  Note that this is true even for instances that were created or last rebooted before we turned on IPv6. On Linux-derived systems you can check using ifconfig eth0 or ip -6 addr list dev eth0; if you see an address that starts with 2001:620:5ca1:, then your instance can speak IPv6.

Note that these IPv6 addresses are “globally unique” and routable, i.e. they are recognized by the general Internet.  In contrast, the IPv4 addresses on the default network are “private” and can only be used locally inside the cloud; communication with the general Internet requires Network Address Translation (NAT).

What you can do with an IPv6 address

Your instance will now be able to talk to other Internet hosts over IPv6. For example, try ping6 mirror.switch.ch or traceroute6 www.facebook.com. This works just like IPv4, except that only a subset of hosts on the Internet speaks IPv6 yet. Fortunately, this subset already includes important services and is growing.  Because IPv6 doesn’t need NAT, routing between your instances and the Internet is less resource-intensive and a tiny bit faster than with IPv4.

But you will also be able to accept connections from other Internet hosts over IPv6. This is different from before: To accept connections over IPv4, you need(ed) a separate public address, a Floating IP in OpenStack terminology.  So if you can get by with IPv6, for example because you only need (SSH or other) access from hosts that have IPv6, then you don’t need to reserve a Floating IP anymore.  This saves you not just work but also money—public IPv4 addresses are scarce, so we need to charge a small “rent” for each Floating IP reserved.  IPv6 addresses are plentiful, so we don’t charge for them.

But isn’t this dangerous?

Instances are now globally reachable by default, but they are still protected by OpenStack’s Security Groups (corresponding to packet filters or access control lists).  The default Security Group only allows outbound connections: Your instance can connect to servers elsewhere, but attempts to connect to your instance will be blocked.  You have probably opened some ports such as TCP port 22 (for SSH) or 80 or 443 (for HTTP/HTTPS) by adding corresponding rules to your own Security Groups.  In these rules, you need to specify address “prefixes” specifying where you want to accept traffic from.  These prefixes can be IPv4 or IPv6—if you want to accept both, you need two rules.

If you want to accept traffic from anywhere, your rules will contain 0.0.0.0/0 as the prefix. To accept IPv6 traffic as well, simply add identical rules with ::/0 as the prefix instead—this is the IPv6 version of the “global” prefix.

What about domain names?

These IPv6 addresses can be entered in the DNS using “AAAA” records. For Floating IPs, we provided pre-registered hostnames of the form fl-34-56.zhdk.cloud.switch.ch. We cannot do that in IPv6, because there are just too many possible addresses. If you require your IPv6 address to map back to a hostname, please let us know and we can add it manually.

OpenStack will learn how to (optionally) register such hostnames in the DNS automatically; but that feature was only added to the latest release (“Mitaka”), and it will be several months before we can deploy this in SWITCHengines.

Upcoming

We would like to also offer IPv6 connectivity to user-created “tenant networks”. Our version of OpenStack almost supports this, but it cannot be fully automated yet. If you need IPv6 on your non-shared network right now, please let us know via the normal support channel, and we’ll set something up manually. But eventually (hopefully soon), getting a globally routable IPv6 prefix for your network should be (almost) as easy as getting a globally routable Floating IP is now.

You can also expect services running on SWITCHengines (SWITCHdrive, SWITCHfilesender and more) to become exposed over IPv6 over the next couple of months. Stay tuned!


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.

http://docs.ceph.com/docs/master/man/8/rbd/

In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    flags:
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder

Conclusion

Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.