SWITCH Cloud Blog


6 Comments

Starting 1000 instances on SWITCHengines

Is it really possible with Openstack to start 1000 instances, make a parallel computation, and then save the data and delete the instances ?
To answer this question we tested it on SWITCHengines. I had a lot of troubles getting this work, and I have to thank other Openstack Operators I been chatting with: Mattia Belluco, Matteo Panella and Anton Aksola.
Our Openstack control plane is deployed with a dedicated pet VM for each Openstack service (Nova, Cinder, Neutron, Glance and Keystone) and a generic controller VM where we run the mysql and the rabbitmq services. This configuration makes possible to monitor each Openstack service as an isolated VM, and it makes easier for us to identify bottlenecks in the control plane.
For this experiment we never used the web interface, but the Openstack CLI with this reference command line.

openstack server create \
--image "Ubuntu Xenial 16.04 (SWITCHengines)" \
--flavor c1.small \
--network demo-network \
--user-data cloud-init.txt \
--key-name mykey \
--min 100 \
--max 100 test

The c1.small flavor has just 1 CPU core and 1GB of RAM.

We did the experiment in 4 steps, trying with 100, 200, 400 and 1000 instances. To make sure that the instances were really started and operational, we used cloud-init to make them phone home to a registration server. This is a very easy cloud-init feature to use, here is an example cloud-init.txt file:


#cloud-config
phone_home:
url: http://x.x.x.x:8000/$INSTANCE_ID/
post: [ hostname, fqdn ]

In this github gist we share the python code to run the registration service.

The first test with 100 instances did not work. We tried a few runs and we always had a minimum of 4 to a maximum of 7 instances that did not start for various reasons. Monitoring our control plane we noticed that we were saturating the CPUs and memory of the nova and neutron pets.
We increased the resources for both the nova and the neutron pets from 4 to 16 CPU cores and we doubled the memory from 8 GB to 16 GB.
After these changes we were able to start 100 instances without problems. We noticed that the neutron pet had an higher load than the nova pet during the process of creating 100 instances.

When we tried with 200 instances, those were all reported as Running by Openstack but we always had a minimum of 8 to a maximum of 20 instances not phoning home. Looking at the serial console with the command:

openstack console log show

we noticed that these instances were not able to get an IP address from the DHCP server, and the DHCP client would give up after 300 seconds. Using the hint that the neutron pet was more loaded than the nova pet, we found out that the nova instances reached the RUNNING state while the corresponding neutron ports were still in the BUILDING phase.
Thinking of a race condition between nova instances and neutron ports, I asked on the Openstack Developers mailing list, and it turned out that we had a wrong configuration.

We changed our nova.conf as follows:

vif_plugging_is_fatal=True
vif_plugging_timeout=300

After fixing the configuration we had the same result, but instead of the instances starting and not being able to obtain an IP address, they never started and were reported in ERROR state by Openstack.
The real challenge was not to schedule 200 instances, but to allocate 200 network ports.
Troubleshooting in this direction we observed that the rabbitmq queues of the neutron dhcp agents were filling up during the ports creation. For each created port the dhcp agent had to add a corresponding line to the file /var/lib/neutron/dhcp/$UUID/host. Where $UUID is the corresponding Neutron network UUID.

We looked into the detail of what happens when a neutron port is created. Using the guru meditation report we traced down the culprit in a slow “ip route list” call.
This command is called everytime a neutron port is created:

time sudo ip netns exec qdhcp-7a1cfb7f-2960-45f5-903f-0d602450525a ip route list
default via 10.10.0.1 dev tapaf136b11-a5
10.10.0.0/16 dev tapaf136b11-a5 proto kernel scope link src 10.10.0.2
real 0m0.048s
user 0m0.000s
sys 0m0.016s

However calling the same command within neutron-rootwrap takes about 10 times longer:

time sudo neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qdhcp-7a1cfb7f-2960-45f5-903f-0d602450525a ip route list dev tapaf136b11-a5
default via 10.10.0.1
10.10.0.0/16 proto kernel scope link src 10.10.0.2
real 0m0.713s
user 0m0.472s
sys 0m0.172s

Once we identified this bottleneck, we changed the configuration again to enable the Openstack rootwrap to work in daemon mode.
We had to change the agent section of neutron.conf

[agent]
root_helper = sudo neutron-rootwrap /etc/neutron/rootwrap.conf
root_helper_daemon=sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf

After this change, we were able to start successfully 200, 400 and 1000 instances.
With 1000 instances we still get a HTTP 504 gateway timeout error.
This is because the nova-api server takes longer than the reverse proxy timeout to answer the request. The reverse proxy replies with HTTP 504 but the nova-api server will later finish to process the request with a HTTP 200. This is easily fixed using a longer timeout, but we plan to trace the problem in detail to shorten the processing time of the request.

Finally the answer is yes, with Openstack it is really possible to start 1000 instances quickly to have compute power just when needed.


Temporary elevated reachability (aka – security issue on some VMs)

TL;DR: A small percentage of VMs running in the Zurich region of SWITCHengines weren’t protected by the default firewall from 11.8 to 18.8.2017. The root problem has been fixed. We have implemented additional measures to prevent this from happening again.
Thanks to one of our users, we recently were made aware of a security problem that has affected a small percentage of our customers virtual machines. While it looked like the machines were protected by the standard OpenStack firewall rules, in effect those machines were completely open to the Internet. We were able to fix the problem within a few hours. Our investigations showed that the problem was existent for roughly one week (starting 11.8.2018, ending 18.8.2017) due to a mismatch between the software deployed on 5 specific hypervisors and the configuration applied to them.
If you were affected by this problem, you already have received an email from us. If you didn’t receive anything, your VMs were secure.

Technical background

Each VM running on SWITCHengines is completely isolated from the Internet. VMs run on a number of hypervisors (physical servers) that share an internal physical network (The hypervisors also are isolated from the Internet and can’t be reached from the outside). In order for a VM to reach the Internet, it has to use a software defined virtual network that connects it to one of our “Network Nodes”. These network nodes are special virtual machines that are have both an address on our private network and and address on the Internet – and can thus bridge between virtual machines and the Internet. They are using a technique called NAT (Network Address Translation) to provide external access to the VMs and vice versa. The software component running on the network nodes is called Neutron – it’s a part of the OpenStack project.
On each hypervisor, another part of Neutron runs. This part controls the connectivity and the security access to the virtual machines running on the hypervisor. The neutron component on the hypervisor is responsible for managing the virtual networks and the access from the VMs to this virtual network. In addition it also controls the Firewall component (we use `iptables` a standard component of the Linux operating system running on the hypervisors). By default, each VM running on the hypervisor is protected by strict security rules that disallow access to any ports on the VM. 
A user can configure these security groups by adding rules through the SWITCHengines GUI. When a security rule is modified, the Neutron Server sends a command to the Neutron component on the Hypervisor that in turn adds the relevant rules to the `iptables` configuration on that specific hypervisor.

Cause of loss of Firewall functionality

Besides upgrading the OpenStack software regularly (every 6 months) we also maintain and upgrade the operating system on the hypervisors. The last months we have been busy upgrading the hypervisors from Ubuntu Trusty (14.04) to Ubuntu Xenial (16.04). Upgrading a server takes a long time, because we live migrate all running VMs from that server to another, then upgrade the server OS (together with upgrades of any installed packages) and then take the server back into production – i.e. move VMs to it. This process has been ongoing for the last 3 months and we will be finished in September 2017.
The 5 hypervisors that were affected by the problem, were the first ones to be upgraded to Ubuntu Xenial and the OpenStack Newton components. Because they were upgraded early, they had an older version of OpenStack Newton installed. There was a bug in the Neutron component of that OpenStack release – however, that bug didn’t surface at first.
On the 11. August, we did routine configuration changes to all hypervisors running on the Newton release. This config change went well on all recent installed hypervisors, but caused the Firewall rules to be dropped on the older machines.
When we were made aware of the problem and upgraded the Newton component, the firewall rules were recreated and the VMs protected. 

Remedies

We have identified two fundamental problems through this incident:
  • Some servers have different software versions of the same components due to them being installed at different times
  • We didn’t detect the lack of firewall rules for the affected VMs
To address the first problem, we are being more strict about specifying the exact release of each software component that we install on all our servers. We strive to have identical installations everywhere.
To address the second problem, we have written a script that checks the firewall status for all running virtual machines. We will incorporate this script into our regular monitoring and testing so that we will be alerted about that problem automatically, should it happen again.
We take the security of SWITCHengines serious and we are sorry that we left some of our customers VMs unprotected. Thanks to the people reporting the problem and thank you for your understanding. We are sorry for any problems this might have caused you.
Jens-Christian Fischer
Product Owner SWITCHengines