SWITCH Cloud Blog


Leave a comment

Temporary elevated reachability (aka – security issue on some VMs)

TL;DR: A small percentage of VMs running in the Zurich region of SWITCHengines weren’t protected by the default firewall from 11.8 to 18.8.2017. The root problem has been fixed. We have implemented additional measures to prevent this from happening again.
Thanks to one of our users, we recently were made aware of a security problem that has affected a small percentage of our customers virtual machines. While it looked like the machines were protected by the standard OpenStack firewall rules, in effect those machines were completely open to the Internet. We were able to fix the problem within a few hours. Our investigations showed that the problem was existent for roughly one week (starting 11.8.2018, ending 18.8.2017) due to a mismatch between the software deployed on 5 specific hypervisors and the configuration applied to them.
If you were affected by this problem, you already have received an email from us. If you didn’t receive anything, your VMs were secure.

Technical background

Each VM running on SWITCHengines is completely isolated from the Internet. VMs run on a number of hypervisors (physical servers) that share an internal physical network (The hypervisors also are isolated from the Internet and can’t be reached from the outside). In order for a VM to reach the Internet, it has to use a software defined virtual network that connects it to one of our “Network Nodes”. These network nodes are special virtual machines that are have both an address on our private network and and address on the Internet – and can thus bridge between virtual machines and the Internet. They are using a technique called NAT (Network Address Translation) to provide external access to the VMs and vice versa. The software component running on the network nodes is called Neutron – it’s a part of the OpenStack project.
On each hypervisor, another part of Neutron runs. This part controls the connectivity and the security access to the virtual machines running on the hypervisor. The neutron component on the hypervisor is responsible for managing the virtual networks and the access from the VMs to this virtual network. In addition it also controls the Firewall component (we use `iptables` a standard component of the Linux operating system running on the hypervisors). By default, each VM running on the hypervisor is protected by strict security rules that disallow access to any ports on the VM. 
A user can configure these security groups by adding rules through the SWITCHengines GUI. When a security rule is modified, the Neutron Server sends a command to the Neutron component on the Hypervisor that in turn adds the relevant rules to the `iptables` configuration on that specific hypervisor.

Cause of loss of Firewall functionality

Besides upgrading the OpenStack software regularly (every 6 months) we also maintain and upgrade the operating system on the hypervisors. The last months we have been busy upgrading the hypervisors from Ubuntu Trusty (14.04) to Ubuntu Xenial (16.04). Upgrading a server takes a long time, because we live migrate all running VMs from that server to another, then upgrade the server OS (together with upgrades of any installed packages) and then take the server back into production – i.e. move VMs to it. This process has been ongoing for the last 3 months and we will be finished in September 2017.
The 5 hypervisors that were affected by the problem, were the first ones to be upgraded to Ubuntu Xenial and the OpenStack Newton components. Because they were upgraded early, they had an older version of OpenStack Newton installed. There was a bug in the Neutron component of that OpenStack release – however, that bug didn’t surface at first.
On the 11. August, we did routine configuration changes to all hypervisors running on the Newton release. This config change went well on all recent installed hypervisors, but caused the Firewall rules to be dropped on the older machines.
When we were made aware of the problem and upgraded the Newton component, the firewall rules were recreated and the VMs protected. 

Remedies

We have identified two fundamental problems through this incident:
  • Some servers have different software versions of the same components due to them being installed at different times
  • We didn’t detect the lack of firewall rules for the affected VMs
To address the first problem, we are being more strict about specifying the exact release of each software component that we install on all our servers. We strive to have identical installations everywhere.
To address the second problem, we have written a script that checks the firewall status for all running virtual machines. We will incorporate this script into our regular monitoring and testing so that we will be alerted about that problem automatically, should it happen again.
We take the security of SWITCHengines serious and we are sorry that we left some of our customers VMs unprotected. Thanks to the people reporting the problem and thank you for your understanding. We are sorry for any problems this might have caused you.
Jens-Christian Fischer
Product Owner SWITCHengines


New version, new features

We are constantly working on SWITCHengines, updating, tweaking stuff. Most of the time, little of this process is visible to users, but sometimes we release features that make a difference in the user experience.

A major change was the upgrade to OpenStack Kilo that we did mid March. OpenStack is the software that powers our cloud, and it gets an update every 6 months. The releases are named alphabetically. Our clouds history started with the “Icehouse” release, moved to “Juno” and now we are on “Kilo”. Yesterday “Mitaka” was released, so we are 2 releases (or 12 months) behind.

Upgrading the cloud infrastructure is major work. Our goal is to upgrade “in place” with all virtual machines running uninterrupted during the upgrade. Other cloud operators install a new version on minimal hardware, then start to migrate the customer machines one by one to the new hardware, and converting the hypervisors one by one. This is certainly feasible, but it causes downtime – something we’d like to avoid.

Therefore we spend a lot of time, testing the upgrade path. The upgrade from “Icehouse” to “Juno”took over 6 months (first we needed to figure out how to do the upgrade in the first place, then had to implement and test it). The upgrade from “Juno” to “Kilo” then only took 4 months (with x-mas and New Year in it). Now we are working on the upgrade to “Liberty” which is planned to happen before June / July. This time, we plan to be even faster, because we are going to upgrade the many components of OpenStack individually. The just release “Mitaka” release should be done before “Newton” is release in October. Our plan is to be at most 6 months behind the official release schedule.

So what does Kilo bring you, the end user? A slightly different user interface, loads of internal changes and a few new major features:

There is also stuff coming in the next few weeks:

  • Access to the SWIFT object store
  • Backup of Volumes (that is something we are testing right now)
  • IPv6 addresses for virtual machines

We have streamlined the deployment process of changes – while we did releases once a week during the last year, we now can deploy new features as soon as they are finished and tested.

 


Is there a chance for a Swiss Academic Cloud?

At our recent ICT Focus Meeting where SWITCH customers and SWITCH employees meet to discuss the needs of customers, Edouard Bugnion, one of the founders of VMware and now professor for Computer Science at EPFL held an interesting keynote speech “Towards Data Center Systems“. Our project lead, Patrik Schnellmann had the opportunity to conduct an interview with Edouard about Swiss Academic Clouds.

We are of course hard at work, to build a cloud offering for Swiss Academia – SWITCHengines and Edouard’s views justify what we are doing.


Doing the right thing

I am returning from GridKA school, held annually at the KIT in Karlsruhe, where I co-hosted a two day workshop on installing OpenStack with Antonio Messina and Tyanko Alekseiev from the university of Zurich. (You can find the course notes and tutorials over on Github ). I don’t want to talk about the workshop so much (it was fun, out attendees were enthusiastic and we ended up with 8 complete OpenStack Grizzly clouds) as about the things that I experienced in the plenary sessions.
A bit of background on me: I joined SWITCH in April 2013 to work on the cloud. Before that, I had been self-employed, running my own companies, worked in a number of startups. I left academia in 1987 (without a degree) and returned to it in 2010 when I started (and  finished) a Masters in Science. Early on, friends and family told me that I should pursue an academic career, but I always wanted to prove myself in the commercial world… Well, being a bit closer to Academia was one of the reasons I joined SWITCH.
Back to GridKA: Presenting at the workshop, teaching and helping people with a complex technical software is something I have done quite a bit over the last 20 years, and something I’m quite good at (or so my students tell me). Nothing special, business as usual so to speak. 
There also was a plenary program with presentations from various people attending GridKA school. And although I only got to see a few of those due to my schedule, I was absolutely blown away by what I heard. Dr. Urban Liebel talked about  microscopes in Life Sciences – the ability to automatically take pictures of thousands of samples and use image recognition algorithms to classify them. He told about some of the results they discovered (Ibuprofen is doing damage to kidneys in children and increases the risk of kidney cancer, something science didn’t know until recently) now that they can investigate more samples faster.
José Luis Vázquez-Poletti in his talk “Cloud Computing: Expanding Humanity’s Limits to Planet Mars” talked about installing meterological sensors on Mars and how to use cloud computing ressources to help pinpoint the location of those sensors, once they had been deployed on Mars (basically by just dropping them down on the surface – ballistic entry). By looking at the transitions of Phobos, the moon of Mars, they are able to determine the location of the landed sensor.
Bendedikt Hegener from CERN talked about “Effective Programming and Multicore Computing” in which he described the trials and tribulations the CERN programmmers have to go through to parallelize 5 million lines of code in order to make the code take advantage of multi-core computers.
There were several other talks that I unfortunately didn’t have a chance to attend. The point of all this?
During those talks it hit me, that the work these scientists are doing is creating value on a much deeper level, than what most startups are creating. By working on the methods to automatically take microscopic pictures and analyse them, and increasing the throughput, these people directly work on the improvments of our living conditions. While the Mars and CERN experiments don’t seem to have immediate benefits, both space research and high energy physics have greatly contributed to our lives as well. A startup that is creating yet another social network, yet another photo sharing site, all with the intent of making investors happy (by generating loads of money) just doesn’t have the same impact on society.
My work here in SWITCH doesnt’t really have the same impact but I think that the work building Cloud infrastructure can help some researchers out there in Switzerland achieve their work more easily, faster or cheaper. In which case, my work at least contributed in a “supporting act”. What more could one want?


The PetaSolutions Blog

Welcome, dear reader, to the Peta Solutions Blog. “Another blog?”, you ask – yes very much so…

Let me start by providing a bit of background to who we are and what we are doing, this might help set the context for the diversity of things you are going to read here.

The Peta Solutions teams is located in the “Researchers and Lecturers” Division of SWITCH. Peta (of course) means big (bigger than Tera, anyway) and gives an indication of what we are working with:
Big things… We are here to help researchers with, shall we say, specialised needs in their ITC infrastructure. This started several years ago with Grid activities (several of our team members have been working in Grid related projects the last years), Cloud (we have been busy building our own cloud over the last months), SDN (Software Defined Networking), Network performance (our PERT – Performance Emergency Response Team stands by in case of performance problems) and more.

We work directly with researchers, and help them getting up to speed on these issues.

So what should you expect from this blog? We have a couple of ideas, some of us have blogged for quite a while, some are taking a wait and see attitude – the normal mix in other words.

We plan to talk about our experiences building, maintaining and operating infrastructure, maybe providing you with the crucical nugget of information that helps you solve a problem. We invite researchers we are working with to share their experiences. We sometimes will wax philosophically about things that are on our collective minds.

In any case, we are happy if all of this turns into a discourse: you are most welcome to respond.

Yours
Alessandra, Alessandro, Jens-Christian, Kurt, Placi, Rüdiger, Sam, Simon, Valery