Our Infrastructure-as-a-Service (IaaS) offering SWITCHengines is based on the OpenStack platform. OpenStack releases new alphabetically-nicknamed versions every six months. When we built SWITCHengines in 2014, we based it on the then-current “Icehouse” (2014.1) release. Over the past few months, we have worked on upgrading the system to the newer “Juno” (2014.2) version. As we already announced via Twitter, this upgrade was finally completed on 26 August. The upgrade was intended to be interruption-free for running customer VMs (including the SWITCHdrive service, which is built on top of such VMs), and we mostly achieved that.
Upgrading a live infrastructure is always a risk, so we should only do so if we have good reasons. On a basic level, we see two drivers: (a) functionality and (b) reliability. Functionality: OpenStack is a very dynamic project to which new features—and entire new subsystems—are added all the time. We want to make sure that our users can benefit from these enhancements. Reliability: Like all complex software, OpenStack has bugs, and we want to offer reliable and unsurprising service. Fortunately, OpenStack also has more and more users, so bugs get reported and eventually fixed, and it has quality assurance (QA) processes that improve over time. Bugs are usually fixed in the most recent releases only. Fixes to serious bugs such as security vulnerabilities are often “backported” to one or two previous releases. But at some point it is no longer safe to use an old release.
Why did it take so long?
We wanted to make sure that the upgrade be as smooth as possible for users. In particular, existing VMs and other resources should remain in place and continue to work throughout the upgrade. So we did a lot of testing on our internal development/staging infrastructure. And we experimented with various different methods for switching over. We also needed to integrate the—significant—changes to the installation system recipes (from the OpenStack Puppet project) with our own customizations.
We also decided to upgrade the production infrastructure in three phases. Two of them had been announced: The LS region (in Lausanne) was upgraded on 17 August, the ZH (Zurich) region one week later. But there are some additional servers with special configuration which host a critical component of SWITCHdrive. Those were upgraded separately another day later.
Because we couldn’t upgrade all hypervisor nodes (the servers on which VMs are hosted) at the same time, we had to run in a compatibility mode that allowed Icehouse hypervisors to work against a Juno controller. After all hypervisor hosts were upgraded, this somewhat complex compatibility mechanism could be disabled again.
The whole process took us around five months. Almost as long as the interval between OpenStack releases! But we learned a lot, and we made some modifications to our setup that will make future upgrades easier. So we are confident that the next upgrade will be quicker.
So it all went swimmingly, right?
Like I wrote above, “mostly”. All VMs kept running throughout the upgrade. As announced, the “control plane” was unavailable for a few hours, during which users couldn’t start new VMs. As also announced, there was a short interruption of network connectivity for every VM. Unfortunately, this interruption turned out to be much longer for some VMs behind user-defined software routers. Some of these routers were misconfigured after the upgrade, and it took us a few hours to diagnose and repair those. Sorry about that!
What changes for me as a SWITCHengines user?
Not much, so far. There are many changes “under the hood”, but only a few are visible. If you use the dashboard (“Horizon”), you will notice a few slight improvements in the user interface. For instance, the selectors for region—LS or ZH—and project—formerly called “tenant”—have been combined into a single element.
The many bug fixes between Icehouse and Juno should make the overall SWITCHengines experience more reliable. If you notice otherwise, please let us know through the usual support channel.
With the upgrade finished, we will switch back to our previous agile process of rolling out small features and fixes every week or so. There are a few old and new glitches that we know we have to fix over the next weeks. We will also add more servers to accommodate increased usage. To support this upgrade, we will replace the current network in the ZH region with a more scalable “leaf/spine” network architecture based on “bare-metal” switches. We are currently testing this in a separate environment.
By the end of the year, we will have a solid infrastructure basis for SWITCHengines, which will “graduate” from its current pilot phase and become a regular service offering in January 2016. In the SCALE-UP project, which started in August 2015 with the generous support of swissuniversities’ SUC P-2 program, many partners from the Swiss university community will work together to add higher-level services and additional platform enhancements. Stay tuned!