SWITCH Cloud Blog

Leave a comment

RadosGW/Keystone Integration Performance Issues—Finally Solved?

For several years we have been running OpenStack and Ceph clusters as part of SWITCHengines, an IaaS offering for the Swiss academic community. Initially, our main “job” for Ceph was to provide scalable block storage for OpenStack VMs—which it does quite well. But we also provided S3 (and Swift, but that’s outside the scope of this post) -based object storage via RadosGW from early on. This easy-to-use object storage turned out to be popular far beyond our initial expectations.

One valuable feature of RadosGW is that it integrates with Keystone, the Authentication and Authorization service in OpenStack. This meant that any user of our OpenStack offering can create, within her Project/tenant, EC2-compatible credentials to set up, and manage access to, S3 object store buckets. And they sure did! SWITCHengines users started to use our object store to store videos (and stream them directly from our object store to users’ browsers), research data for archival and dissemination, external copies from (parts of) their enterprise backup systems, and presumably many other interesting things; a “defining characteristic” of the cloud is that you don’t have to ask for permission (see “On-demand self-service” in the NIST Cloud definition)—though as a community cloud provider, we are happy to hear about, and help with, specific use cases.

Now this sounds pretty close to cloud nirvana, but… there was a problem: Each time a client made an authenticated (signed) S3 request on any bucket, RadosGW had to outsource the validation of the request signature to Keystone, which would return either the identity of the authenticated user (that RadosGW could then use for authorization purposes), or a negative reply in case the signature doesn’t validate. Unfortunately, this outsourced signature validation process turns out to bring significant per-request overhead. In fact, for “easy” requests such as reading and writing small objects, this authentication overhead easily dominates total processing time. For a sense of the magnitude, small requests without Keystone validation often take <10ms to complete (according to the logs of our NGinx-based HTTPS server that acts as a front end to the RadosGW nodes. Whereas any request involving Keystone takes at least 600ms.

One undesirable effect is that our users probably wonder why simple requests have such a high baseline response time. Transfers of large objects don’t care much, because at some point the processing time is dominated by Rados/network transfer time of user data.

But an even worse effect is that S3 users could, by using client software that “aggressively” exploited parallelism, put very high load on our Keystone service, to the point that OpenStack operations sometimes ran into timeouts when they needed to use the authentication/authorization service.

In our struggle to cope with this reoccurring issue, we found a somewhat ugly workaround: When we found a EC2 credential in Keystone whose use in S3/RadosGW contributed significant load, we extracted that credential (basically an ID/secret pair) from Keystone, and provisioned it locally in all of our RadosGW instances. This always solved the individual performance problem for that client, response times dropped by 600ms immediately, and load on our Keystone system subsided.

While the workaround fixed our immediate troubles, it was deeply unsatisfying in several ways:

  • Need to identify “problematic” S3 uses that caused high Keystone load
  • Need to (more or less manually) re-provision Keystone credentials in RadosGW
  • Risk of “credential drift” in case the Keystone credentials changed (or disappeared) after their re-provisioning in RadosGW—the result would be that clients would still be able to access resources that they shouldn’t (anymore).

But the situation was bearable for us, and we basically resigned to having to fix performance emergencies every once in a while until maybe one day, someone would write a Python script or something that would synchronize EC2 credentials between Keystone and RadosGW…

PR #26095: A New Hope

But then out of the blue, James Weaver from the BBC contributed PR #26095, rgw: Added caching for S3 credentials retrieved from keystone. This changes the approach to signature validation when credentials are found in Keystone: The key material (including secret key) found in Keystone is cached by RadosGW, and RadosGW always performs signature validation locally.

James’s change was merged into master and will presumably come out with the “O” release of Ceph. We run Nautilus, and when we got wind of this change, we were excited to try it out. We had some discussions as to whether the patch might be backported to Nautilus; in the end we considered that unlikely at the current state, because the patch unconditionally changes the behavior in a way that could violate some security assumptions (e.g. that EC2 secrets would never leave Keystone).

We usually avoid carrying local patches, but in this case we were sufficiently motivated to go and cherry-pick the change on top of the version we were running (initially v14.2.5, later v14.2.6 and v14.2.7). We basically followed the instructions on how to build Ceph, but after cloning the Ceph repo, ran

git checkout v14.2.7
git cherry-pick affb7d396f76273e885cfdbcd363c1882496726c -m 1 -v
edit debian/changelog and prepend:

ceph (14.2.7-1bionic-switch1) stable; urgency=medium

  * Cherry-picked upstream pull #26095:

    rgw: Added caching for S3 credentials retrieved from keystone

 -- Simon Leinen <simon.leinen@switch.ch>  Thu, 01 Feb 2020 19:51:21 +0000

Then, dpkg-buildpackage and wait for a couple of hours…

First Results

We tested the resulting RadosGW package in our staging environment for a couple of days before trying them in our production clusters.

When we activated the patched RadosGW in production, the effects were immediately visible: The CPU load of our Keystone system went down by orders of magnitude.

Screenshot 2020-02-02 at 10.29.54

On 2020-01-27 at around 08:00, we upgraded our first production cluster’s RadosGWs. Twenty-four hours later, we upgraded the RadosGWs on the second cluster. The baseline load on our Keystone service dropped visibly on the first upgrade, but some high load peaks could still be seen. Since the second region was upgraded, no sharp peaks anymore. There is a periodic load increase every night between 03:10 and 04:10, presumably due to some charging/accounting system doing its thing. Probably these peaks were “always” there, but they only became apparent once we started deploying the credential-caching code.

The 95th-percentile latency of “small” requests (defined as both $body_bytes_sent and $request_length being lower than 65536 was reduced from ~750ms to ~100ms:



Conclusion and Outlook

We owe the BBC a beer.

To make the patch perfect, maybe it would be cool to limit the lifetime of cached credentials to some reasonable value such as a few hours. This could limit the damage in case credentials should be invalidated. Though I guess you could just restart all RadosGW processes and lose any cached credentials immediately.

If you are interested in using our RadosGW packages made from cherry-picking PR #20965 on top of Nautilus, please contact us. Note that we only have x86_64 packages for Ubuntu 18.04 “Bionic” GNU/Linux.

Openstack Horizon runs on Kubernetes in production at SWITCH

In April we upgraded the SWITCHengines OpenStack Horizon dashboard to the OpenStack Pike version. But this upgrade was a little bit special, it was more than an Horizon upgrade from Newton to Pike.

Our Horizon deployment is now hosted on a Kubernetes cluster. The cluster is deployed using the playbook k8s-on-openstack that we actively develop. We have been testing this Kubernetes deployment for a while, but it is only when you have to deploy an application on top of it in production that you really learn and you fix real problems.

Horizon is a good application to start learning Kubernetes, because it is completely stateless and it does not require any persistent storage. It is just a GUI to the OpenStack API. The user logs in with his credentials, and Horizon will get a token and will start making API calls with the user’s credentials.

Running Horizon in a single Kubernetes pod for a demo takes probably 5 minutes, but deploying for production usage is far more complex. We needed to address the following issues:

  • Horizontally scale the number of pods, keeping a central memcached or redis cache
  • Allow both IPv4 and IPv6 access to engines.switch.ch
  • Define the Load Balancing architecture
  • Implement a persistent logging system

If you want to run to the solution of all these problems, you can have a look at the project SWITCH-openstack-horizon-k8s-deployment where we have published all the Dockerfiles and the Kubernetes descriptors to recreate our deployment.

Scale Horizontally

Horizon performs much faster when it accesses a memory cache, it is the recommended way to deploy in production. We decided to go for Redis cache.

Creating a Redis service in our namespace with the name redis-master we are able to use the special environment variable ${REDIS_MASTER_SERVICE_HOST} when booting the Horizon container, to make sure all the instances point to the same cache server.

This is a good example of how you combine two services together in a Kubernetes namespace. We can horizontally scale the Horizon pods, but the Horizon deployment is independent from the Redis deployment.

IPv4 and IPv6

We always publish our services on IPv6. In our previous Kubernetes demos we used the OpenStack LBaaS to expose services to the outside world. Unfortunately in the Newton version of OpenStack, the LBaaS lacks proper IPv6 integration. To publish a production service on Kubernetes, we suggest to use an ingress controller. There are several kinds available, but we used the standard Nginx ingress controller. The key idea is that we have a K8s node with an interface exposed to the public Internet where a privileged Docker container is running with –net=host. The container runs Nginx that can bind to IPv6 and IPv4 on the node, but of course it can also reach any other pod on the cluster network.

Define the Load Balancing architecture

I already wrote above that if you need IPv6, you should not use the Openstack LBaaSv2. However I am going to explain why I would not use that kind of load balancer even for IPv4.

The first picture shows the network diagram of a LBaaSv2 deployment. The LoadBalancer is implemented as a network namespace on the network node, called qlbaas-<uuid>, in which a HAProxy process is running. This is a L4 LoadBalancer. The bad thing of this architecture is that when an instance boots, the default gateway configured via DHCP will be the IP address of the neutron router. When we expose a service with the floating IP configured on the outer interface of the LBaaS, in order to force the traffic to follow a symmetric return path, the Load Balancer must perform a DNAT and SNAT operation. This means that the IP packets hitting the Pod have completely lost the information about the source IP address of the original client. Because it is a pure L4 load balancer, we don’t have the possibility to carry this lost information on in a HTTP header. This prevents the operator from building any useful logging system, because once the traffic arrives at the pod, the information about the client is filtered out.

In the next picture we have a look on how the Nginx ingress works. In this case the external traffic is received on a public floating IP that is configured on the virtual machine running the ingress pod, in this case on the master. We terminate the TLS connection at the nginx-ingress. This is necessary because the ingress also has to perform a SNAT and DNAT but it adds to the HTTP requests the X-Forwarded-For header that we use to populate our log files. We could not add the header if we were just moving encrypted packets around.

Another advantage of this solution is that it uses just a normal instance to implement the ingress, this means that you can use in a totally independent way from the version of OpenStack you are running on.

In the future you might be able to use the newer OpenStack Octavia Load Balancer, but at the moment I did not investigate that. All I know is that the solution is really similar, but you will have an OpenStack service VM running an Nginx instance.

Implement a persistent logging system

Pods are short lived and distributed over different VMs that are also ephemeral. To collect the logs, we run docker with the log-driver journald. Once this is set up, all the docker containers running on the host will send their logging output to journald. We then collect this information with journalbeat to send the data to our elastic search cluster. This part is not yet released into our public playbook because is not very portable. If you don’t have a ready-to-use ELK cluster, you would have no benefit from running journalbeat.


It is now almost a month that we have been running in production, and we found the system to be robust and stable. We had no complaints from our users, so we can say that the migration was seamless for our users. We have learned a lot from this experience.

In the next blog post we will describe how we implemented the metrics monitoring, to observe how much memory and CPU cores each pod is consuming. Make sure you keep an eye on our blog for updates.

Deploy Kubernetes v1.8.3 on Openstack with native Neutron networking

I wrote in the past how to deploy Kubernetes on SWITCHengines (Openstack) using this ansible playbook. When I wrote that article, I did not care about the networking setup, and I used the proposed weavenet plugin. I went to Sydney at the Openstack Summit and I saw the great presentation from Angus Lees. It was the right time to see the presentation because I recently watched this video where they explain the networking of Kubernetes when running on GCE. Going back to Openstack, Angus mentioned that the Kubernetes master can talk to neutron, to inject routes in the tenant router to provide connectivity without NAT among the pods that live in different instances. This would make easier the troubleshooting, and would leave MTU 1500 between the pods.

It looked very easy, just use:


and specify in the cloud config the router uuid.

Our first tests with version 1.7.0 did not work. First of all I had to fix the Kubernetes documentation, because the syntax to specify the router UUID was wrong. Then I had a problem with Security groups disappearing from the instances. After troubleshooting and asking for help on the Kubernetes slack channel, I found out that I was hitting a gophercloud known bug.

The bug was already fixed in gophercloud at the time of my finding, but I learned that Kubernetes freezes an older version of this library in the folder “vendor/github.com/gophercloud/gophercloud”. So the only way to get the updated library version was to upgrade to Kubernetes v1.8.0, or any newer version including this commit.

After a bit of testing every works now. The changes are summarised in this PR, or you can just use the master branch from my git repository.

After you deploy, the K8s master will assign from network ClusterCIDR (usually a /16 address space) a smaller /24 subnet per each Openstack instance. The Pods will get addresses from the subnet assigned to the instance. The kubernetes master will inject static routes to the neutron router, to be able to route packets to the Pods. It will also configure the neutron ports of the instances with the correct allowed_address_pairs value, so that the traffic is not dropped by the Openstack antispoofing rules.

This is what a show of the Openstack router looks like:

$ openstack router show b11216cb-a725-4006-9a55-7853d66e5894 -c routes
| Field  | Value                                            |
| routes | destination='', gateway=''  |
|        | destination='', gateway=''  |
|        | destination='', gateway='' |
|        | destination='', gateway='' |

And this is what the allowed_address_pairs on the port of one instance looks like:

$ openstack port show 42f2a063-a316-4fe2-808c-cd2d4ed6592f -c allowed_address_pairs
| Field                 | Value                                                      |
| allowed_address_pairs | ip_address='', mac_address='fa:16:3e:3e:34:2c' |

There is of course more work to be done.

I will improve the ansible playbook to create automatically the Openstack router and network, at the moment these steps are done manually before starting the playbook.

Working with network-plugin=kubenet is actually deprecated, so I have to understand what is the long term plan for this way of deployment.

The Kubernetes master is still running on a single VM, the playbook can be extended to have an HA setup.

I really would like to have feedback from users of Kubernetes on Openstack. If you use this playbook please let me know, and if you improve it, the Pull Requests on github are very welcome! 🙂

(Ceph) storage (server) power usage

We have been running Ceph in production for SWITCHengines since mid-2014, and are at the third generation of servers now.

SWITCHengines storage server evolution

  • First generation, since March 2014: 2U Dalco based on Intel S2600GZ, 2×E5-2650v2 CPUs, 128GB RAM, 2×200GB Intel DC S3610 SSD, 12×WD SE 4TB
  • Second generation, since Dec. 2015: 2U Dalco based on Intel S2600WTT, 1×E5-2620v4 CPU, 64GB RAM, 2×200GB Intel DC S3610 SSD, 12×WD SE 4TB
  • Third generation, since June 2017: 1U Quanta S1Q-1ULH-8, 1×Xeon D-1541 CPU, 64GB RAM, 2×240GB Micron 5100 MAX SSD, 12×HGST Ultrastar He8 (TB)

What all servers have in common: 2×10GE (SFP+ DAC) network connections, redundant power supplies, simple BMC modules connected to separate GigE network.

We run all those servers together in a single large Ceph RADOS cluster (actually we have two clusters in different towns, but for this article I focus on just the larger and more heavily loaded one). The cluster has 480 OSDs, contains about 500TiB user data, mostly RBD block devices used by OpenStack instances, and some S3 object storage, including video streaming directly from RadosGW to browsers. Cluster-wide I/O rates during my measurements were around 2’700 IOPS, 150MB/s read, 65MB/s write. We didn’t apply any particular optimization for energy or otherwise.

To understand the story behind the server types: Initially we used the same server chassis for compute and storage servers. We also used the same relatively generous CPU and RAM configurations. This would have allowed us to turn compute into storage servers (or vice-versa) relatively easily. When purchasing the second server generation we saved some money by reducing CPU power and RAM. For the third generation, we opted for increased density and efficiency made possible by a “system-on-a-chip” (Xeon D)-based server design.

Power measurement results

All these servers have IPMI-accessible power sensors. Last week my colleagues did some measurement with an external power meter, and found that (for the servers they tested—not all types, for lack of time) the values from the IPMI readers are within 5% or so of the values from the “real” power meter. Good enough!

Unfortunately we don’t yet feed IPMI measurements into any of our continuous measurement tools (Carbon/Graphite/Grafana or Nagios). If you do that, please use the comments to tell us how you set this up.

But recently I looked at the IPMI power consumption readings for these servers during a time of relatively light use (weekend) and got the following results:

  • Gen 1: 248W
  • Gen 2: 205W
  • Gen 3: 155W

Note that the Gen 3 servers have larger disks, and thus Ceph puts twice as much data on them, and thus they get double the IOPS than the old servers. Still, they use significantly less power. This is partly due to the simplified mainboard and more modern CPU, partly to the Helium-filled disks that only draw ~4.5W each (when idle) as opposed to ~7.5W for the older 4TB drives.

Cost to power a Terabyte-year of user data

Just for fun, I also performed some cost calculations in relation to usable space, under the following assumptions:

  1. We pay 0.15 €/kWh. (actually I used CHF, but doesn’t really matter—some countries will pay more than this even without overhead, others pay only half, so this would cover some of the other directly energy-dependent costs like A/C and redundancy. Anyway it’s about the relative costs 🙂
  2. We can fill disks up to an average of 70% before things get messy.

When using traditional three-way replication, storing a usable Terabyte for a year costs us the following amounts just for power:

  • Gen 1: € 29.10
  • Gen 2: € 24.05
  • Gen 3: € 9.09 (note again that these have twice the capacity)

If we assume Erasure Coding with 50% overhead, e.g. 2+1, then the power cost would go down to

  • Gen 1: € 14.55
  • Gen 2: € 12.03
  • Gen 3: € 4.55

We could consider even more space-efficient EC configurations, but I don’t have any experience with that… “left as an exercise for the reader”.

In conclusion, we could say that advances in hardware (more efficient servers, larger and more efficient disks), software (EC in Ceph), as well as our own optimizations (less spare CPU/RAM) have brought down the power component of our storage costs by a factor of 6.5 over these 3.5 years. Not bad, huh? Of course there are trade-offs: the new servers have lower IOPS per space, EC uses more CPU and disk-read operations etc.

The next frontier: powering down idle disks

Finally, I also took an unused server of the Gen 3 ones (we keep a few powered down until we need them—I powered one on for the test). It consumed 136W, not much less than the 155W under (light) load. Although the 12 8TB disks weren’t even mounted, they were spinning. Putting them all in “standby” mode with sudo hdparm -y lowered the total system load to just 82W. So for infrequently accessed “cold storage”, there’s even more room for optimization—although it might be tricky to leverage standby mode in practice with a system such as Ceph. At least scrubbing strategy would need to be adapted, I guess.

SWITCHdrive Over IPv6

When we built the SWITCHdrive service on the OpenStack platform that was to become SWITCHengines, that platform didn’t really support IPv6 yet. But since Spring 2016 it does. This week, we enabled IPv6 in SWITCHdrive and performed some internal tests. Today around noon, we published its IPv6 address (“AAAA record”) in the DNS. We quickly saw around 5% of accesses use IPv6 instead of IPv4.

In the evening, this percentage climbed to about 14%. This shows the relatively good support for IPv6 on Swiss broadband (home) networks, notably by the good folks at Swisscom.

The lower percentage during office (and lecture, etc.) hours shows that the IPv6 roll-out to higher education campuses still has some way to go. Our SWITCHlan backbone has been running “dual-stack” (IPv4 and IPv6 in parallel) in production for more than 10 years, and most institutions have added IPv6 configuration to their connections to us. But campus networks are wonderfully complex, so getting IPv6 deployed to every network plug and every wireless access point is a daunting task. Some schools are almost there, including some large ones that don’t use SWITCHdrive—yet!?—so the 5% may underestimate the extent of the roll-out for the overall SWITCH community. The others will follow in their footsteps. They can count on the help of the community and benefit from IPv6 training courses organized by our colleagues in the security and network teams. Contact us if you need help!

[Update: After a few weeks, the proportion of IPv6 traffic increased somewhat. Now we typically see around 10% during office hours and 20% during weekends. So the “retail” sector is still clearly ahead of (our academic) enterprise networks in terms of IPv6 penetration.]

Deploy Kubernetes on the SWITCHengines Openstack cloud

Increasing demand for container orchestration tools is coming from our users. Kubernetes has currently a lot of hype, and often it comes the question if we are providing a Kubernetes cluster at SWITCH.

At the moment we suggest that our users deploy their own Kubernetes cluster on top of SWITCHengines. To make sure our Openstack deployment works with this solution we tried ourself.

After deploying manually with kubeadm to learn the tool, I found a well written ansible playbook from Francois Deppierraz. I extended the playbook to make Kubernetes aware that SWITCHengines implements the LBaaSv2, and the patch is now merged in the original version.

The first problem I discovered deploying Kubernetes is the total lack of support for IPv6. Because instances in SWITCHengines get IPv6 addresses by default, I run into problems running the playbook and nothing was working. The first thing you should do is to create your own tenant network with a router, with IPv4 only connectivity. This is already explained in detail in our standard documentation.

Now we are ready to clone the ansible playbook:

git clone https://github.com/infraly/k8s-on-openstack

Because the ansible playbook creates instances through the Openstack API, you will have to source your Openstack configuration file. We extend a little bit the usual configuration file with more variables that are specific to this ansible playbook. Lets see a template:

export OS_USERNAME=username
export OS_PASSWORD=mypassword
export OS_PROJECT_NAME=myproject
export OS_PROJECT_ID=myproject_uuid
export OS_AUTH_URL=https://keystone.cloud.switch.ch:5000/v2.0
export KEY=keyname
export IMAGE="Ubuntu Xenial 16.04 (SWITCHengines)"
export NETWORK=k8s
export SUBNET_UUID=subnet_uuid
export FLOATING_IP_NETWORK_UUID=network_uuid

Lets review what changes. It is important to add also the variable OS_PROJECT_ID because the Kubernetes code that creates Load Balancers requires this value, and it is not able to extract it from the project name. To find the uuid just use the Openstack cli:

openstack project show myprojectname -f value -c id

The KEY is the name of an existing keypair that will be used to start the instances. The IMAGE is also self explicative, at the moment only Xenial is tested by me. The variable NETWORK is the name of the tenant network you created earlier. When you created a network you created also a subnet, and you need to set the uuid into SUBNET_UUID. The last variable is FLOATING_IP_NETWORK_UUID that tells kubernetes the network where to get floating IPs. In SWITCHengines this network is always called public, so you can extract the uuid like this:

openstack network show public -f value -c id

You can customize your configuration even more, reading the README file in the git repository you will find more options like the flavors to use or the cluster size. When your configuration file is ready you can run the playbook:

source /path/to/config_file
cd k8s-on-openstack
ansible-playbook site.yaml

It will take a few minutes to go through all the tasks. When everything is done you can ssh into the kubernetes master instance and check that everything is running as expected:

ubuntu@k8s-master:~$ kubectl get nodes
NAME         STATUS    AGE       VERSION
k8s-1        Ready     2d        v1.6.2
k8s-2        Ready     2d        v1.6.2
k8s-3        Ready     2d        v1.6.2
k8s-master   Ready     2d        v1.6.2

I found very useful adding bash completion for kubectl:

source <(kubectl completion bash)

Lets deploy an instance of nginx to test if everything works:

kubectl run my-nginx --image=nginx --replicas=2 --port=80

This will create two containers with nginx. You can monitor the progress with the commands:

kubectl get pods
kubectl get events

At this stage you have your containers running but the service is still not accessible from the outside. One option is to use the Openstack LBaaS to expose it, you can do it with this command:

kubectl expose deployment my-nginx --port=80 --type=LoadBalancer

The expose command will create the Openstack Load Balancer and will configure it. To know the public floating ip address you can use this command to describe the service:

ubuntu@k8s-master:~$ kubectl describe service my-nginx
Name:			my-nginx
Namespace:		default
Labels:			run=my-nginx
Selector:		run=my-nginx
Type:			LoadBalancer
LoadBalancer Ingress:,
Port:				80/TCP
NodePort:			30620/TCP
Session Affinity:	None
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1m		1m		1	service-controller			Normal		CreatingLoadBalancer	Creating load balancer
  10s		10s		1	service-controller			Normal		CreatedLoadBalancer	Created load balancer


Following this blog post you should be able to deploy Kubernetes on Openstack to understand how things work. For a real deployment you might want to make some customisations, we encourage you to share any patch to the ansible playbook with github pull requests.
Please note that Kubernetes is not bug free. When you will delete your deployment you might find this bug where Kubernetes is not able to delete correctly the load balancer. Hopefully this is fixed by the time you read this blog post.

IPv6 Address Assignment in OpenStack

In an inquiry “IPv6 and Liberty (or Mitaka)” on the openstack mailing list,

Ken D’Ambrosio writes:
> Hey, all. I have a Liberty cloud, and decided for the heck of it to
> start dipping my toe into IPv6. I do have some confusion, however. I
> can choose between SLAAC, DHCPv6 stateful and DHCPv6 stateless — and
> I see some writeups on what they do, but I don’t understand what
> differentiates them. As far as I can tell, they all do pretty much
> the same thing, just with different pieces doing different things.
> E.g., the chart, found here
> (http://docs.openstack.org/liberty/networking-guide/adv-config-ipv6.html
> — page down a little) shows those three options, but it isn’t clear:
> * How to configure the elements involved
> * What they exactly do (e.g., “optional info”? What’s that?)
> * Why there even *are* different choices. Do they offer functionally
> different results?

SLAAC and DHCPv6-stateless use the same mechanism (SLAAC) to provide instances with IPv6 addresses. The only difference between them is that with DHCPv6-stateless, the instance can also use DHCPv6 requests to get other (than its own address) information such as nameserver addresses etc. So between SLAAC and DHCPv6-stateless, I would always prefer DHCPv6-stateless—it’s a strict superset in terms of functionality, and I don’t see any particular risks associated with it.

DHCPv6-stateful is a different beast: It will use DHCPv6 to give an instance its IPv6 address. DHCPv6 actually fits OpenStack’s model better than SLAAC.

Why DHCPv6-Stateful Fits OpenStack Better

OpenStack (Nova) sees it as part of its job to control the IP address(es) that an instance uses. In IPv4 it uses DHCP (always did). DHCP assigns complete addresses—which are under control of OpenStack. In IPv6, stateful DHCPv6 would be the equivalent.

SLAAC is different in that the node (instance) actually chooses its address based on information it gets from the router. The most common method is that the node uses an “EUI-64” address as the local part (host ID) of the address. The EUI-64 is derived from the MAC address by a fixed algorithm. This can work with OpenStack because OpenStack controls the MAC addresses too, and can thus “guess” what IPv6 address an instance will auto-configure on a given network. You see how this is a little less straightforward than OpenStack simply telling the instance what IPv6 address it should use.

In practice, OpenStack’s guessing fails when an instance uses other methods to get the local part, for example “privacy addresses” according to RFC 4941. These will lead to conflicts with OpenStack’s built-in anti-spoofing filters. So such mechanisms need to be disabled when SLAAC is used under OpenStack (including under “DHCPv6-stateless”).

Why we Use SLAAC/DHCPv6-Stateless Anyway

Unfortunately, most GNU/Linux distributions don’t support Stateful DHCPv6 “out of the box” today.

Because we want our users to use unmodified operating systems images and still get usable IPv6, we have grudgingly decided to use DHCPv6-stateless. For configuration information, see SWITCHengines Under the Hood: Basic IPv6 Configuration.

If you decide to go for DHCPv6-stateful, then there’s a Web page that explains how to enable it client-side for a variety of GNU/Linux distributions.

It would be nice if all systems honored the “M” (Managed) flag in Router Advertisements and would use DHCPv6 if it is set, otherwise SLAAC.

[This is an edited version of my response, which I wasn’t sure I was allowed to post because I use GMANE to read the list– SL.]