TL;DR: A small percentage of VMs running in the Zurich region of SWITCHengines weren’t protected by the default firewall from 11.8 to 18.8.2017. The root problem has been fixed. We have implemented additional measures to prevent this from happening again.
Thanks to one of our users, we recently were made aware of a security problem that has affected a small percentage of our customers virtual machines. While it looked like the machines were protected by the standard OpenStack firewall rules, in effect those machines were completely open to the Internet. We were able to fix the problem within a few hours. Our investigations showed that the problem was existent for roughly one week (starting 11.8.2018, ending 18.8.2017) due to a mismatch between the software deployed on 5 specific hypervisors and the configuration applied to them.
If you were affected by this problem, you already have received an email from us. If you didn’t receive anything, your VMs were secure.
Each VM running on SWITCHengines is completely isolated from the Internet. VMs run on a number of hypervisors (physical servers) that share an internal physical network (The hypervisors also are isolated from the Internet and can’t be reached from the outside). In order for a VM to reach the Internet, it has to use a software defined virtual network that connects it to one of our “Network Nodes”. These network nodes are special virtual machines that are have both an address on our private network and and address on the Internet – and can thus bridge between virtual machines and the Internet. They are using a technique called NAT (Network Address Translation) to provide external access to the VMs and vice versa. The software component running on the network nodes is called Neutron – it’s a part of the OpenStack project.
On each hypervisor, another part of Neutron runs. This part controls the connectivity and the security access to the virtual machines running on the hypervisor. The neutron component on the hypervisor is responsible for managing the virtual networks and the access from the VMs to this virtual network. In addition it also controls the Firewall component (we use `iptables` a standard component of the Linux operating system running on the hypervisors). By default, each VM running on the hypervisor is protected by strict security rules that disallow access to any ports on the VM.
A user can configure these security groups by adding rules through the SWITCHengines GUI. When a security rule is modified, the Neutron Server sends a command to the Neutron component on the Hypervisor that in turn adds the relevant rules to the `iptables` configuration on that specific hypervisor.
Cause of loss of Firewall functionality
Besides upgrading the OpenStack software regularly (every 6 months) we also maintain and upgrade the operating system on the hypervisors. The last months we have been busy upgrading the hypervisors from Ubuntu Trusty (14.04) to Ubuntu Xenial (16.04). Upgrading a server takes a long time, because we live migrate all running VMs from that server to another, then upgrade the server OS (together with upgrades of any installed packages) and then take the server back into production – i.e. move VMs to it. This process has been ongoing for the last 3 months and we will be finished in September 2017.
The 5 hypervisors that were affected by the problem, were the first ones to be upgraded to Ubuntu Xenial and the OpenStack Newton components. Because they were upgraded early, they had an older version of OpenStack Newton installed. There was a bug in the Neutron component of that OpenStack release – however, that bug didn’t surface at first.
On the 11. August, we did routine configuration changes to all hypervisors running on the Newton release. This config change went well on all recent installed hypervisors, but caused the Firewall rules to be dropped on the older machines.
When we were made aware of the problem and upgraded the Newton component, the firewall rules were recreated and the VMs protected.
We have identified two fundamental problems through this incident:
- Some servers have different software versions of the same components due to them being installed at different times
- We didn’t detect the lack of firewall rules for the affected VMs
To address the first problem, we are being more strict about specifying the exact release of each software component that we install on all our servers. We strive to have identical installations everywhere.
To address the second problem, we have written a script that checks the firewall status for all running virtual machines. We will incorporate this script into our regular monitoring and testing so that we will be alerted about that problem automatically, should it happen again.
We take the security of SWITCHengines serious and we are sorry that we left some of our customers VMs unprotected. Thanks to the people reporting the problem and thank you for your understanding. We are sorry for any problems this might have caused you.
Product Owner SWITCHengines