SWITCH Cloud Blog


Leave a comment

SWITCHdrive Over IPv6

When we built the SWITCHdrive service on the OpenStack platform that was to become SWITCHengines, that platform didn’t really support IPv6 yet. But since Spring 2016 it does. This week, we enabled IPv6 in SWITCHdrive and performed some internal tests. Today around noon, we published its IPv6 address (“AAAA record”) in the DNS. We quickly saw around 5% of accesses use IPv6 instead of IPv4.

Screen Shot 2017-09-28 at 22.10.45

In the evening, this percentage climbed to about 14%. This shows the relatively good support for IPv6 on Swiss broadband (home) networks, notably by the good folks at Swisscom.

The lower percentage during office (and lecture, etc.) hours shows that the IPv6 roll-out to higher education campuses still has some way to go. Our SWITCHlan backbone has been running “dual-stack” (IPv4 and IPv6 in parallel) in production for more than 10 years, and most institutions have added IPv6 configuration to their connections to us. But campus networks are wonderfully complex, so getting IPv6 deployed to every network plug and every wireless access point is a daunting task. Some schools are almost there, including some large ones that don’t use SWITCHdrive—yet!?—so the 5% may underestimate the extent of the roll-out for the overall SWITCH community. The others will follow in their footsteps. They can count on the help of the community and benefit from IPv6 training courses organized by our colleagues in the security and network teams. Contact us if you need help!


6 Comments

Starting 1000 instances on SWITCHengines

Is it really possible with Openstack to start 1000 instances, make a parallel computation, and then save the data and delete the instances ?
To answer this question we tested it on SWITCHengines. I had a lot of troubles getting this work, and I have to thank other Openstack Operators I been chatting with: Mattia Belluco, Matteo Panella and Anton Aksola.
Our Openstack control plane is deployed with a dedicated pet VM for each Openstack service (Nova, Cinder, Neutron, Glance and Keystone) and a generic controller VM where we run the mysql and the rabbitmq services. This configuration makes possible to monitor each Openstack service as an isolated VM, and it makes easier for us to identify bottlenecks in the control plane.
For this experiment we never used the web interface, but the Openstack CLI with this reference command line.

openstack server create \
--image "Ubuntu Xenial 16.04 (SWITCHengines)" \
--flavor c1.small \
--network demo-network \
--user-data cloud-init.txt \
--key-name mykey \
--min 100 \
--max 100 test

The c1.small flavor has just 1 CPU core and 1GB of RAM.

We did the experiment in 4 steps, trying with 100, 200, 400 and 1000 instances. To make sure that the instances were really started and operational, we used cloud-init to make them phone home to a registration server. This is a very easy cloud-init feature to use, here is an example cloud-init.txt file:


#cloud-config
phone_home:
url: http://x.x.x.x:8000/$INSTANCE_ID/
post: [ hostname, fqdn ]

In this github gist we share the python code to run the registration service.

The first test with 100 instances did not work. We tried a few runs and we always had a minimum of 4 to a maximum of 7 instances that did not start for various reasons. Monitoring our control plane we noticed that we were saturating the CPUs and memory of the nova and neutron pets.
We increased the resources for both the nova and the neutron pets from 4 to 16 CPU cores and we doubled the memory from 8 GB to 16 GB.
After these changes we were able to start 100 instances without problems. We noticed that the neutron pet had an higher load than the nova pet during the process of creating 100 instances.

When we tried with 200 instances, those were all reported as Running by Openstack but we always had a minimum of 8 to a maximum of 20 instances not phoning home. Looking at the serial console with the command:

openstack console log show

we noticed that these instances were not able to get an IP address from the DHCP server, and the DHCP client would give up after 300 seconds. Using the hint that the neutron pet was more loaded than the nova pet, we found out that the nova instances reached the RUNNING state while the corresponding neutron ports were still in the BUILDING phase.
Thinking of a race condition between nova instances and neutron ports, I asked on the Openstack Developers mailing list, and it turned out that we had a wrong configuration.

We changed our nova.conf as follows:

vif_plugging_is_fatal=True
vif_plugging_timeout=300

After fixing the configuration we had the same result, but instead of the instances starting and not being able to obtain an IP address, they never started and were reported in ERROR state by Openstack.
The real challenge was not to schedule 200 instances, but to allocate 200 network ports.
Troubleshooting in this direction we observed that the rabbitmq queues of the neutron dhcp agents were filling up during the ports creation. For each created port the dhcp agent had to add a corresponding line to the file /var/lib/neutron/dhcp/$UUID/host. Where $UUID is the corresponding Neutron network UUID.

We looked into the detail of what happens when a neutron port is created. Using the guru meditation report we traced down the culprit in a slow “ip route list” call.
This command is called everytime a neutron port is created:

time sudo ip netns exec qdhcp-7a1cfb7f-2960-45f5-903f-0d602450525a ip route list
default via 10.10.0.1 dev tapaf136b11-a5
10.10.0.0/16 dev tapaf136b11-a5 proto kernel scope link src 10.10.0.2
real 0m0.048s
user 0m0.000s
sys 0m0.016s

However calling the same command within neutron-rootwrap takes about 10 times longer:

time sudo neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qdhcp-7a1cfb7f-2960-45f5-903f-0d602450525a ip route list dev tapaf136b11-a5
default via 10.10.0.1
10.10.0.0/16 proto kernel scope link src 10.10.0.2
real 0m0.713s
user 0m0.472s
sys 0m0.172s

Once we identified this bottleneck, we changed the configuration again to enable the Openstack rootwrap to work in daemon mode.
We had to change the agent section of neutron.conf

[agent]
root_helper = sudo neutron-rootwrap /etc/neutron/rootwrap.conf
root_helper_daemon=sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf

After this change, we were able to start successfully 200, 400 and 1000 instances.
With 1000 instances we still get a HTTP 504 gateway timeout error.
This is because the nova-api server takes longer than the reverse proxy timeout to answer the request. The reverse proxy replies with HTTP 504 but the nova-api server will later finish to process the request with a HTTP 200. This is easily fixed using a longer timeout, but we plan to trace the problem in detail to shorten the processing time of the request.

Finally the answer is yes, with Openstack it is really possible to start 1000 instances quickly to have compute power just when needed.


Leave a comment

Temporary elevated reachability (aka – security issue on some VMs)

TL;DR: A small percentage of VMs running in the Zurich region of SWITCHengines weren’t protected by the default firewall from 11.8 to 18.8.2017. The root problem has been fixed. We have implemented additional measures to prevent this from happening again.
Thanks to one of our users, we recently were made aware of a security problem that has affected a small percentage of our customers virtual machines. While it looked like the machines were protected by the standard OpenStack firewall rules, in effect those machines were completely open to the Internet. We were able to fix the problem within a few hours. Our investigations showed that the problem was existent for roughly one week (starting 11.8.2018, ending 18.8.2017) due to a mismatch between the software deployed on 5 specific hypervisors and the configuration applied to them.
If you were affected by this problem, you already have received an email from us. If you didn’t receive anything, your VMs were secure.

Technical background

Each VM running on SWITCHengines is completely isolated from the Internet. VMs run on a number of hypervisors (physical servers) that share an internal physical network (The hypervisors also are isolated from the Internet and can’t be reached from the outside). In order for a VM to reach the Internet, it has to use a software defined virtual network that connects it to one of our “Network Nodes”. These network nodes are special virtual machines that are have both an address on our private network and and address on the Internet – and can thus bridge between virtual machines and the Internet. They are using a technique called NAT (Network Address Translation) to provide external access to the VMs and vice versa. The software component running on the network nodes is called Neutron – it’s a part of the OpenStack project.
On each hypervisor, another part of Neutron runs. This part controls the connectivity and the security access to the virtual machines running on the hypervisor. The neutron component on the hypervisor is responsible for managing the virtual networks and the access from the VMs to this virtual network. In addition it also controls the Firewall component (we use `iptables` a standard component of the Linux operating system running on the hypervisors). By default, each VM running on the hypervisor is protected by strict security rules that disallow access to any ports on the VM. 
A user can configure these security groups by adding rules through the SWITCHengines GUI. When a security rule is modified, the Neutron Server sends a command to the Neutron component on the Hypervisor that in turn adds the relevant rules to the `iptables` configuration on that specific hypervisor.

Cause of loss of Firewall functionality

Besides upgrading the OpenStack software regularly (every 6 months) we also maintain and upgrade the operating system on the hypervisors. The last months we have been busy upgrading the hypervisors from Ubuntu Trusty (14.04) to Ubuntu Xenial (16.04). Upgrading a server takes a long time, because we live migrate all running VMs from that server to another, then upgrade the server OS (together with upgrades of any installed packages) and then take the server back into production – i.e. move VMs to it. This process has been ongoing for the last 3 months and we will be finished in September 2017.
The 5 hypervisors that were affected by the problem, were the first ones to be upgraded to Ubuntu Xenial and the OpenStack Newton components. Because they were upgraded early, they had an older version of OpenStack Newton installed. There was a bug in the Neutron component of that OpenStack release – however, that bug didn’t surface at first.
On the 11. August, we did routine configuration changes to all hypervisors running on the Newton release. This config change went well on all recent installed hypervisors, but caused the Firewall rules to be dropped on the older machines.
When we were made aware of the problem and upgraded the Newton component, the firewall rules were recreated and the VMs protected. 

Remedies

We have identified two fundamental problems through this incident:
  • Some servers have different software versions of the same components due to them being installed at different times
  • We didn’t detect the lack of firewall rules for the affected VMs
To address the first problem, we are being more strict about specifying the exact release of each software component that we install on all our servers. We strive to have identical installations everywhere.
To address the second problem, we have written a script that checks the firewall status for all running virtual machines. We will incorporate this script into our regular monitoring and testing so that we will be alerted about that problem automatically, should it happen again.
We take the security of SWITCHengines serious and we are sorry that we left some of our customers VMs unprotected. Thanks to the people reporting the problem and thank you for your understanding. We are sorry for any problems this might have caused you.
Jens-Christian Fischer
Product Owner SWITCHengines


Deploy Kubernetes on the SWITCHengines Openstack cloud

Increasing demand for container orchestration tools is coming from our users. Kubernetes has currently a lot of hype, and often it comes the question if we are providing a Kubernetes cluster at SWITCH.

At the moment we suggest that our users deploy their own Kubernetes cluster on top of SWITCHengines. To make sure our Openstack deployment works with this solution we tried ourself.

After deploying manually with kubeadm to learn the tool, I found a well written ansible playbook from Francois Deppierraz. I extended the playbook to make Kubernetes aware that SWITCHengines implements the LBaaSv2, and the patch is now merged in the original version.

The first problem I discovered deploying Kubernetes is the total lack of support for IPv6. Because instances in SWITCHengines get IPv6 addresses by default, I run into problems running the playbook and nothing was working. The first thing you should do is to create your own tenant network with a router, with IPv4 only connectivity. This is already explained in detail in our standard documentation.

Now we are ready to clone the ansible playbook:

git clone https://github.com/infraly/k8s-on-openstack

Because the ansible playbook creates instances through the Openstack API, you will have to source your Openstack configuration file. We extend a little bit the usual configuration file with more variables that are specific to this ansible playbook. Lets see a template:

export OS_USERNAME=username
export OS_PASSWORD=mypassword
export OS_PROJECT_NAME=myproject
export OS_PROJECT_ID=myproject_uuid
export OS_AUTH_URL=https://keystone.cloud.switch.ch:5000/v2.0
export OS_REGION_NAME=ZH
export KEY=keyname
export IMAGE="Ubuntu Xenial 16.04 (SWITCHengines)"
export NETWORK=k8s
export SUBNET_UUID=subnet_uuid
export FLOATING_IP_NETWORK_UUID=network_uuid

Lets review what changes. It is important to add also the variable OS_PROJECT_ID because the Kubernetes code that creates Load Balancers requires this value, and it is not able to extract it from the project name. To find the uuid just use the Openstack cli:

openstack project show myprojectname -f value -c id

The KEY is the name of an existing keypair that will be used to start the instances. The IMAGE is also self explicative, at the moment only Xenial is tested by me. The variable NETWORK is the name of the tenant network you created earlier. When you created a network you created also a subnet, and you need to set the uuid into SUBNET_UUID. The last variable is FLOATING_IP_NETWORK_UUID that tells kubernetes the network where to get floating IPs. In SWITCHengines this network is always called public, so you can extract the uuid like this:

openstack network show public -f value -c id

You can customize your configuration even more, reading the README file in the git repository you will find more options like the flavors to use or the cluster size. When your configuration file is ready you can run the playbook:

source /path/to/config_file
cd k8s-on-openstack
ansible-playbook site.yaml

It will take a few minutes to go through all the tasks. When everything is done you can ssh into the kubernetes master instance and check that everything is running as expected:

ubuntu@k8s-master:~$ kubectl get nodes
NAME         STATUS    AGE       VERSION
k8s-1        Ready     2d        v1.6.2
k8s-2        Ready     2d        v1.6.2
k8s-3        Ready     2d        v1.6.2
k8s-master   Ready     2d        v1.6.2

I found very useful adding bash completion for kubectl:

source <(kubectl completion bash)

Lets deploy an instance of nginx to test if everything works:

kubectl run my-nginx --image=nginx --replicas=2 --port=80

This will create two containers with nginx. You can monitor the progress with the commands:

kubectl get pods
kubectl get events

At this stage you have your containers running but the service is still not accessible from the outside. One option is to use the Openstack LBaaS to expose it, you can do it with this command:

kubectl expose deployment my-nginx --port=80 --type=LoadBalancer

The expose command will create the Openstack Load Balancer and will configure it. To know the public floating ip address you can use this command to describe the service:

ubuntu@k8s-master:~$ kubectl describe service my-nginx
Name:			my-nginx
Namespace:		default
Labels:			run=my-nginx
Annotations:		
Selector:		run=my-nginx
Type:			LoadBalancer
IP:			10.109.12.171
LoadBalancer Ingress:	10.8.10.15, 86.119.34.151
Port:				80/TCP
NodePort:			30620/TCP
Endpoints:		10.40.0.1:80,10.43.0.1:80
Session Affinity:	None
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1m		1m		1	service-controller			Normal		CreatingLoadBalancer	Creating load balancer
  10s		10s		1	service-controller			Normal		CreatedLoadBalancer	Created load balancer

Conclusion

Following this blog post you should be able to deploy Kubernetes on Openstack to understand how things work. For a real deployment you might want to make some customisations, we encourage you to share any patch to the ansible playbook with github pull requests.
Please note that Kubernetes is not bug free. When you will delete your deployment you might find this bug where Kubernetes is not able to delete correctly the load balancer. Hopefully this is fixed by the time you read this blog post.


Hosting and computing public scientific datasets in the cloud

SWITCH offers a cloud computing service called SWITCHengines, using OpenStack and Ceph. These computing resources are targeted to research usage, where the demand for Big Data (Hadoop, Spark) workloads is increasing. Big Data Analysis requires access to large datasets. In several use cases the analysis is done against public available datasets.  To add value to the cloud computing service our plan is to provide a place for researchers to store their data sets and make them available to others. A possible alternative at the moment is Amazon EC2, because many datasets are available for free in Amazon S3 when the computation is done within the Amazon infrastructure. Storing as many datasets as Amazon does is challenging because the real data utilisation in a object storage system is usually 3 times the real data. Each dataset has a size of 100s of TB, so storing a few of them with a 3 times replica factor means working in the PB domain.

Technical challenges

While it sounds a easy task to share some scientific data in a cloud datacenter, there are some problems you will have to address.

1) The size of the datasets is challenging. It will cost money to host the datasets. To make the service cost sustainable you must find reduced redundancy solutions, that have a reasonable cost and a reasonable risk of data loss. This is acceptable because there datasets are public and most likely there are going to be other providers hosting a copy, so in case of disaster data recovery will be possible. The key idea is to store the dataset locally to speed up access to the data during computation, not to store the dataset for persistent storage. Before we can host a dataset, we have to download it from another source. Also this first copy operation is challenging because of the size of the data involved.

2) We need a proper ACL system to control who can access the data. We can’t just publish all the datasets with public access. Some datasets require the user to sign an End User Agreement.

How we store the data

We use Ceph in our cloud datacenter as storage backend. We have Ceph pools for the OpenStack rbd volumes, but we also use the rados gateway to provide an object storage service, accessible via S3 and Swift compatible APIs. We believe that the best solution to access the Scientific Datasets is to store them on Object Storage, because the size of the data requires a technical solution that can scale horizontally spreading the data on many heterogeneous disks. Our Ceph cluster works with a replication factor of 3. This means that each object exists in 3 replicas on three different servers. To work with big amounts of data we deployed in production Erasure Coded Pools in our Ceph cluster. In this case at the cost of an higher CPU computation on the storage nodes, is possible to store the data with an expansion for 1.5 instead of 3. We found very useful the blog post cephnotes that explains how to create a new Ceph pool with Erasure Coding to be used with the rados gateway. We are now running this setup in production, and when we create a S3 bucket we can decide if the bucket will land on the normal pool or on the erasure coded pool. To make the decision we use the bucket-location command line option like in this this example:

s3cmd mb --bucket-location=:ec-placement s3://mylargedatabucket

It is important to note that the bucket-location option should be there when creating the S3 bucket. When writing objects at a later time, the objects will always go to the right pool.

How we download the first copy

How do you download a dataset of 200TB from an external source ? How long does it take ? The task if of course not trivial, to address this problem Amazon started his snowball service. The key idea is that data is shipped to you via snail mail on a physical device. Because we are a NREN, we decided to download the datasets over the GEANT network from other institutions that are hosting datasets. We used rclone for our download operations. It is a modern object storage client written in go, that supports multiple protocols. Working with big amount of data we soon found some issues with the client. The community behind rclone was very active and helped us. “Fail to copy files larger than 80GB” was the most severe problem we had. We also improved the documentation about Swift keys with slashes do not work with Ceph in swift emulation mode.

We also had some issues with the rados gateway. When we write the data, this goes into a Ceph bucket. In theory the data in the bucket can be written and read using both the S3 and the Swift APIs that the rados gateway implements. In reality we found out that there is a difference in how the checksums for the multipart objects is computed when using the S3 or the Swift API. This means that if wrote a object, you have to read it back with the same API you used for writing, otherwise your client will warn you that the object is corrupted. We notified the Ceph developers about this problem opening a bug that is still open.

Set the ACL to the dataset

When we store a dataset we would like to have read-write access for the cloud admin accounts, and read-only access for the users that are allowed to read the data. This means that after the dataset is uploaded, we should be able to easily grant and revoke access to users and projects. S3 and Swift manage the ACL in very different ways. In S3 the concept of inheritance of the ACL from a parent bucket does not exist. This means that for each individual object a specific ACL must be set. We read this in the official documentation:

Bucket and object permissions are completely independent; an object does not inherit the permissions from its bucket. For example, if you create a bucket and grant write access to another user, you will not be able to access that user’s objects unless the user explicitly grants you access.

This does not scale with the size of the dataset because making an API call for each object is really time consuming. There is a workaround, that it is to mark the bucket as completely public, bypassing the authorization for all objects. This workaround does not match with our use case where we have some users with read-only permission but the dataset is not public. As a viable workaround we create for each dataset a dedicated OpenStack tenant ‘Dataset X readonly access’ and what we do is adding and removing users from this special tenant, because this is a fast operation. It becomes very important to set the correct ACL when writing the objects for the first time. Unfortunately our favourite client rclone had no existent support for setting ACLs, so we had to use s3cmd.

We found out that also s3cmd ignores the –acl-grant flag, we notified the bug to the community but the bug is still open. This means that for each object you want to write you need two HTTP requests, one to actually write the object and one to set the proper ACL.

In OpenStack Swift you can set the ACL for a Swift container, and all the objects in the container will inherit that property. This sounds very good, but we are not running a real OpenStack Swift deployment, we serve objects in Ceph using an implementation of the Swift api.

During our tests we found out that the rados gateway Swift API implementation is not complete: ACL are not implemented. Other openstack operators reported that native Swift deployments work well with ACL and inheritance of ACL to the objects. However, when touching ACL and large objects also Swift is not bug free.

Access the dataset

To be able to compute the data without wasting time copying the data from the object store to a HDFS deployment, the best is to use Hadoop directly attached to the object store, in what is called the streaming mode. The result of the computation can be stored as well on the object store, or on a smaller HDFS cluster. To get started we published this tutorial.

Unfortunately the latest versions of Hadoop require the S3 backend to support the AWS4 signature, that will not work immediately with your SWITCHengines credentials because of this bug triggered when using Keystone together with the rados gateway. As a workaround we manually have to add the user on our rados gateway deployment, to make the AWS4 signature work correctly.

Conclusion

Making available large scientific datasets on a multi tenant cloud is still challenging but not impossible. Bring the data close to the compute power is of great importance. Letting the user compute public scientific datasets makes a scientific cloud service more attractive.

Moreoever, we envision users producing scientific datasets, and being able to share them with other researchers for further analysis.


IPv6 Address Assignment in OpenStack

In an inquiry “IPv6 and Liberty (or Mitaka)” on the openstack mailing list,

Ken D’Ambrosio writes:
> Hey, all. I have a Liberty cloud, and decided for the heck of it to
> start dipping my toe into IPv6. I do have some confusion, however. I
> can choose between SLAAC, DHCPv6 stateful and DHCPv6 stateless — and
> I see some writeups on what they do, but I don’t understand what
> differentiates them. As far as I can tell, they all do pretty much
> the same thing, just with different pieces doing different things.
> E.g., the chart, found here
> (http://docs.openstack.org/liberty/networking-guide/adv-config-ipv6.html
> — page down a little) shows those three options, but it isn’t clear:
> * How to configure the elements involved
> * What they exactly do (e.g., “optional info”? What’s that?)
> * Why there even *are* different choices. Do they offer functionally
> different results?

SLAAC and DHCPv6-stateless use the same mechanism (SLAAC) to provide instances with IPv6 addresses. The only difference between them is that with DHCPv6-stateless, the instance can also use DHCPv6 requests to get other (than its own address) information such as nameserver addresses etc. So between SLAAC and DHCPv6-stateless, I would always prefer DHCPv6-stateless—it’s a strict superset in terms of functionality, and I don’t see any particular risks associated with it.

DHCPv6-stateful is a different beast: It will use DHCPv6 to give an instance its IPv6 address. DHCPv6 actually fits OpenStack’s model better than SLAAC.

Why DHCPv6-Stateful Fits OpenStack Better

OpenStack (Nova) sees it as part of its job to control the IP address(es) that an instance uses. In IPv4 it uses DHCP (always did). DHCP assigns complete addresses—which are under control of OpenStack. In IPv6, stateful DHCPv6 would be the equivalent.

SLAAC is different in that the node (instance) actually chooses its address based on information it gets from the router. The most common method is that the node uses an “EUI-64” address as the local part (host ID) of the address. The EUI-64 is derived from the MAC address by a fixed algorithm. This can work with OpenStack because OpenStack controls the MAC addresses too, and can thus “guess” what IPv6 address an instance will auto-configure on a given network. You see how this is a little less straightforward than OpenStack simply telling the instance what IPv6 address it should use.

In practice, OpenStack’s guessing fails when an instance uses other methods to get the local part, for example “privacy addresses” according to RFC 4941. These will lead to conflicts with OpenStack’s built-in anti-spoofing filters. So such mechanisms need to be disabled when SLAAC is used under OpenStack (including under “DHCPv6-stateless”).

Why we Use SLAAC/DHCPv6-Stateless Anyway

Unfortunately, most GNU/Linux distributions don’t support Stateful DHCPv6 “out of the box” today.

Because we want our users to use unmodified operating systems images and still get usable IPv6, we have grudgingly decided to use DHCPv6-stateless. For configuration information, see SWITCHengines Under the Hood: Basic IPv6 Configuration.

If you decide to go for DHCPv6-stateful, then there’s a Web page that explains how to enable it client-side for a variety of GNU/Linux distributions.

It would be nice if all systems honored the “M” (Managed) flag in Router Advertisements and would use DHCPv6 if it is set, otherwise SLAAC.

[This is an edited version of my response, which I wasn’t sure I was allowed to post because I use GMANE to read the list– SL.]


Tuning Virtualized Network Node: multi-queue virtio-net

The infrastructure used by SWITCHengines is composed of about 100 servers. Each one uses two 10 Gb/s network ports. Ideally, a given instance (virtual machine) on SWITCHengines would be able to achieve 20 Gb/s of throughput when communicating with the rest of the Internet. In the real world, however, several bottlenecks limit this rate. We are working hard to address these bottlenecks and bring actual performance closer to the theoretical optimum.

An important bottleneck in our infrastructure is the network node for each region. All logical routers are implemented on this node, using Linux network namespaces and Open vSwitch (OVS). That means that all packets between the Internet and all the instances of the region need to pass through the node.

In our architecture, the OpenStack services run inside various virtual machines (“service VMs” or “pets”) on a dedicated set of redundant “provisioning” (or “prov”) servers. This is good for serviceability and reliability, but has some overhead—especially for I/O-intensive tasks such as network packet forwarding. Our network node is one of those service VMs.

In the original configuration, a single instance would never get more than about 2 Gb/s of throughput to the Internet when measured with iperf. What’s worse, the aggregate Internet throughput for multiple VMs was not much higher, which meant that a single high-traffic VM could easily “starve” all other VMs.

We had investigated many options for improving this situation: DVR, multiple network nodes, SR-IOV, DPDK, moving work to switches etc. But each of these methods has its drawbacks such as additional complexity (and thus potential for new and exciting bugs and hard-to-debug failure modes), lock-in, and in some cases, loss of features like IPv6 support. So we stayed with our inefficient but simple configuration that has worked very reliably for us so far.

Multithreading to the Rescue!

Our network node is a VM with multiple logical CPUs. But when running e.g. “top” during high network load, we noticed that only one (virtual) core was busy forwarding packets. So we started looking for a way to distribute the work over several cores. We found that we could achieve this by enabling three things:

Multi-queue virtio-net interfaces

Our service nodes run under libvirt/Qemu/KVM and use virtio-net network devices. These interfaces can be configured to expose multiple queues. Here is an example of an interface definition in libvirt XML syntax which has been configured for eight queues:

 <interface type='bridge'>
   <mac address='52:54:00:e0:e1:15'/>
   <source bridge='br11'/>
   <model type='virtio'/>
   <driver name='vhost' queues='8'/>
   <virtualport type='openvswitch'/>
 </interface>

A good rule of thumb is to set the number of queues to the number of (virtual) CPU cores of the system.

Multi-threaded forwarding in the network node VM

Within the VM, kernel threads need to be allocated to the interface queues. This can be achieved using ethtool -L:

ethtool -L eth3 combined 8

This should be done during interface initialization, for example in a “pre-up” action in /etc/network/interfaces. But it seems to be possible to change this configuration on a running interface without disruption.

Recent version of the Open vSwitch datapath

Much of the packet forwarding on the network node is performed by OVS. Its “datapath” portion is integrated into the Linux kernel. Our systems normally run Ubuntu 14.04, which includes the Linux 3.13 kernel. The OVS kernel module isn’t included with this package, but is installed separately from the openvswitch-datapath-dkms package, which corresponds to the relatively old OVS version 2.0.2. Although the OVS kernel datapath is supposed to have been multi-threaded since forever, we found that in our setup, upgrading to a newer kernel is vital for getting good (OVS) network performance.

The current Ubuntu 16.04.1 LTS release includes a fairly new Linux kernel based on 4.4. That kernel also has the OVS datapath module included by default, so that the separate DKMS package is no longer necessary. Unfortunately we cannot upgrade to Ubuntu 16.04 because that would imply upgrading all OpenStack packages to OpenStack “Mitaka”, and we aren’t quite ready for that. But thankfully, Canonical makes newer kernel packages available for older Ubuntu releases as part of their “hardware enablement” effort, so it turns out to be very easy to upgrade 14.04 to the same new kernel:

sudo apt-get install -y --install-recommends linux-generic-lts-xenial

And after a reboot, the network node should be running a fresh Linux 4.4 kernel with the OVS 2.5 datapath code inside.

Results

A simple test is to run multiple netperf TCP_STREAM tests in parallel from a single bare-metal host to six VMs running on separate nova-compute nodes behind the network node.

Each run consists of six netperf TCP_STREAM measurements started in parallel, whose throughput values are added together. Each figure is the average over ten consecutive runs with identical configuration.

The network node VM is set up with 8 vCPUs, and the two interfaces that carry traffic are configured with 8 queues each. We vary the number of queues that are actually used using ethtool -L iface combined n. (Note that even the 1-queue case does not exactly correspond to the original situation; but it’s the closest approximation that we had time to test.)

Network node running 3.13.0-95-generic kernel

1: 3.28 Gb/s
2: 3.41 Gb/s
4: 3.51 Gb/s
8: 3.57 Gb/s

Making use of multiple queues gives very little benefit.

Network node running 4.4.0-36-generic kernel

1: 3.23 Gb/s
2: 6.00 Gb/s
4: 8.02 Gb/s
8: 8.42 Gb/s (8.75 Gb/s with 12 target VMs)

Here we see that performance scales up nicely with multiple queues.

The maximum possible throughput in our setup is lower than 10 Gb/s, because the network node VM uses a single physical 10GE interface for both sides of traffic. And traffic between the network node and the hypervisors is sent encapsulated in VXLAN, which has some overhead.

Outlook

Now we know how to enable multi-core networking for hand-configured service VMs (“pets”) such as our network node. But what about the VMs under OpenStack’s control?

Starting in Liberty, Nova supports multi-queue virtio-net. Our benchmarking cluster was still running Kilo, so we could not test that yet. But stay tuned!