SWITCH Cloud Blog


Leave a comment

Hosting and computing public scientific datasets in the cloud

SWITCH offers a cloud computing service called SWITCHengines, using OpenStack and Ceph. These computing resources are targeted to research usage, where the demand for Big Data (Hadoop, Spark) workloads is increasing. Big Data Analysis requires access to large datasets. In several use cases the analysis is done against public available datasets.  To add value to the cloud computing service our plan is to provide a place for researchers to store their data sets and make them available to others. A possible alternative at the moment is Amazon EC2, because many datasets are available for free in Amazon S3 when the computation is done within the Amazon infrastructure. Storing as many datasets as Amazon does is challenging because the real data utilisation in a object storage system is usually 3 times the real data. Each dataset has a size of 100s of TB, so storing a few of them with a 3 times replica factor means working in the PB domain.

Technical challenges

While it sounds a easy task to share some scientific data in a cloud datacenter, there are some problems you will have to address.

1) The size of the datasets is challenging. It will cost money to host the datasets. To make the service cost sustainable you must find reduced redundancy solutions, that have a reasonable cost and a reasonable risk of data loss. This is acceptable because there datasets are public and most likely there are going to be other providers hosting a copy, so in case of disaster data recovery will be possible. The key idea is to store the dataset locally to speed up access to the data during computation, not to store the dataset for persistent storage. Before we can host a dataset, we have to download it from another source. Also this first copy operation is challenging because of the size of the data involved.

2) We need a proper ACL system to control who can access the data. We can’t just publish all the datasets with public access. Some datasets require the user to sign an End User Agreement.

How we store the data

We use Ceph in our cloud datacenter as storage backend. We have Ceph pools for the OpenStack rbd volumes, but we also use the rados gateway to provide an object storage service, accessible via S3 and Swift compatible APIs. We believe that the best solution to access the Scientific Datasets is to store them on Object Storage, because the size of the data requires a technical solution that can scale horizontally spreading the data on many heterogeneous disks. Our Ceph cluster works with a replication factor of 3. This means that each object exists in 3 replicas on three different servers. To work with big amounts of data we deployed in production Erasure Coded Pools in our Ceph cluster. In this case at the cost of an higher CPU computation on the storage nodes, is possible to store the data with an expansion for 1.5 instead of 3. We found very useful the blog post cephnotes that explains how to create a new Ceph pool with Erasure Coding to be used with the rados gateway. We are now running this setup in production, and when we create a S3 bucket we can decide if the bucket will land on the normal pool or on the erasure coded pool. To make the decision we use the bucket-location command line option like in this this example:

s3cmd mb --bucket-location=:ec-placement s3://mylargedatabucket

It is important to note that the bucket-location option should be there when creating the S3 bucket. When writing objects at a later time, the objects will always go to the right pool.

How we download the first copy

How do you download a dataset of 200TB from an external source ? How long does it take ? The task if of course not trivial, to address this problem Amazon started his snowball service. The key idea is that data is shipped to you via snail mail on a physical device. Because we are a NREN, we decided to download the datasets over the GEANT network from other institutions that are hosting datasets. We used rclone for our download operations. It is a modern object storage client written in go, that supports multiple protocols. Working with big amount of data we soon found some issues with the client. The community behind rclone was very active and helped us. “Fail to copy files larger than 80GB” was the most severe problem we had. We also improved the documentation about Swift keys with slashes do not work with Ceph in swift emulation mode.

We also had some issues with the rados gateway. When we write the data, this goes into a Ceph bucket. In theory the data in the bucket can be written and read using both the S3 and the Swift APIs that the rados gateway implements. In reality we found out that there is a difference in how the checksums for the multipart objects is computed when using the S3 or the Swift API. This means that if wrote a object, you have to read it back with the same API you used for writing, otherwise your client will warn you that the object is corrupted. We notified the Ceph developers about this problem opening a bug that is still open.

Set the ACL to the dataset

When we store a dataset we would like to have read-write access for the cloud admin accounts, and read-only access for the users that are allowed to read the data. This means that after the dataset is uploaded, we should be able to easily grant and revoke access to users and projects. S3 and Swift manage the ACL in very different ways. In S3 the concept of inheritance of the ACL from a parent bucket does not exist. This means that for each individual object a specific ACL must be set. We read this in the official documentation:

Bucket and object permissions are completely independent; an object does not inherit the permissions from its bucket. For example, if you create a bucket and grant write access to another user, you will not be able to access that user’s objects unless the user explicitly grants you access.

This does not scale with the size of the dataset because making an API call for each object is really time consuming. There is a workaround, that it is to mark the bucket as completely public, bypassing the authorization for all objects. This workaround does not match with our use case where we have some users with read-only permission but the dataset is not public. As a viable workaround we create for each dataset a dedicated OpenStack tenant ‘Dataset X readonly access’ and what we do is adding and removing users from this special tenant, because this is a fast operation. It becomes very important to set the correct ACL when writing the objects for the first time. Unfortunately our favourite client rclone had no existent support for setting ACLs, so we had to use s3cmd.

We found out that also s3cmd ignores the –acl-grant flag, we notified the bug to the community but the bug is still open. This means that for each object you want to write you need two HTTP requests, one to actually write the object and one to set the proper ACL.

In OpenStack Swift you can set the ACL for a Swift container, and all the objects in the container will inherit that property. This sounds very good, but we are not running a real OpenStack Swift deployment, we serve objects in Ceph using an implementation of the Swift api.

During our tests we found out that the rados gateway Swift API implementation is not complete: ACL are not implemented. Other openstack operators reported that native Swift deployments work well with ACL and inheritance of ACL to the objects. However, when touching ACL and large objects also Swift is not bug free.

Access the dataset

To be able to compute the data without wasting time copying the data from the object store to a HDFS deployment, the best is to use Hadoop directly attached to the object store, in what is called the streaming mode. The result of the computation can be stored as well on the object store, or on a smaller HDFS cluster. To get started we published this tutorial.

Unfortunately the latest versions of Hadoop require the S3 backend to support the AWS4 signature, that will not work immediately with your SWITCHengines credentials because of this bug triggered when using Keystone together with the rados gateway. As a workaround we manually have to add the user on our rados gateway deployment, to make the AWS4 signature work correctly.

Conclusion

Making available large scientific datasets on a multi tenant cloud is still challenging but not impossible. Bring the data close to the compute power is of great importance. Letting the user compute public scientific datasets makes a scientific cloud service more attractive.

Moreoever, we envision users producing scientific datasets, and being able to share them with other researchers for further analysis.


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.

http://docs.ceph.com/docs/master/man/8/rbd/

In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    flags:
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder

Conclusion

Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.

 


Hack Neutron to add more IP addresses to an existing subnet

When we designed our OpenStack cloud at SWITCH, we created a network in the service tenant, and we called it private.

This network is shared with all tenants and it is the default choice when you start a new instance. The name private comes from the fact that you will get a private IP via dhcp. The subnet we choosed for this network is the 10.0.0.0/24. The allocation pool goes from 10.0.0.2 to 10.0.0.254 and it can’t be enlarged anymore. This is a problem because we need IP addresses for many more instances.

In this article we explain how we successfully enlarged this subnet to a wider range: 10.0.0.0/16. This operation is not a feature supported by Neutron in Juno, so we show how to hack into Neutron internals. We were able to successfully enlarge the subnet and modify the allocation pool, without interrupting the service for the existing instances.

In the following we assume that the network we are talking about has only 1 router, however this procedure can be easily extended to more complex setups.

What you should know about Neutron, is that a Neutron network has two important namespaces in the OpenStack network node.

  • The qrouter is the router namespace. In our setup one interface is attached to the private network we need to enlarge and a second interface is attached to the external physical network.
  • The qdhcp name space has only 1 interface to the private network. On your OpenStack network node you will find that a dnsmasq process is running bound to this interface to provide IP addresses via DHCP.
Neutron Architecture

Neutron Architecture

In the figure Neutron Architecture we try to give an overview of the overall system. A Virtual Machine (VM) can run on any remote compute node. The compute node has a Open vSwitch process running, that collects the traffic from the VM and with proper VXLAN encapsulation delivers the traffic to the network node. The Open vSwitch at the network node has a bridge containing both the qrouter namespace internal interface and the qdhcp namespace, this will make the VMs see both the default gateway and the DHCP server on the virtual L2 network. The qrouter namespace has a second interface to the external network.

Step 1: hack the Neutron database

In the Neutron database look for the subnet, you can easily find your subnet in the table matching the service tenant id:

select * from subnets WHERE tenant_id='d447c836b6934dfab41a03f1ff96d879';

Take note of id (that in this table is the subnet_id) and network_id of the subnet. In our example we had these values:

id (subnet_id) = 2e06c039-b715-4020-b609-779954fa4399
network_id = 1dc116e9-1ec9-49f6-9d92-4483edfefc9c
tenant_id = d447c836b6934dfab41a03f1ff96d879

Now let’s look into the routers database table:

select * from routers WHERE tenant_id='d447c836b6934dfab41a03f1ff96d879';

Again filter for the service tenant. We take note of the router ID.

 id (router_id) = aba1e526-05ca-4aca-9a80-01601cdee79d

At this point we have all the information we need to enlarge the subnet in the Neutron database.

update subnets set cidr='NET/MASK' WHERE id='subnet_id';

So in our example:

update subnets set cidr='10.0.0.0/16' WHERE id='2e06c039-b715-4020-b609-779954fa4399';

Nothing will happen immediately after you update the values in the Neutron mysql database. You could reboot your network node and Neutron would rebuild the virtual routers with the new database values. However, we show a better solution to avoid downtime.

Step 2: Update the interface of the qrouter namespace

On the network node there is a namespace qrouter-<router_id> . Let’s have a look at the interfaces using iproute2:

sudo ip netns exec qrouter-(router_id) ip addr show

With the values in our example:

sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr show

You will see the typical Linux output with all the interfaces that live in this namespace. Take note of the interface name with the address 10.0.0.1/24 that we want to change, in our case

 qr-396e87de-4b

Now that we know the interface name we can change IP address and mask:

sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr add 10.0.0.1/16 dev qr-396e87de-4b
sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr del 10.0.0.1/24 dev qr-396e87de-4b

Step 3: Update the interface of the qdhcp namespace

Still on the network node there is a namespace qdhcp-<network_id>. Exactly in the same way we did for the qrouter namespace we are going to find the interface name, and change the IP address with the updated netmask.

sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr add 10.0.0.2/24 dev tapadebc2ff-10
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr del 10.0.0.2/16 dev tapadebc2ff-10
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show

The dnsmasq process running bounded to the interface in the qdhcp namespace, is smart enough to detect automatically the change in the interface configuration. This means that the new instances at this point will get via DHCP a /16 netmask.

Step 4: (Optional) Adjust the subnet name in Horizon

We called the subnet name 10.0.0.0/24. For pure cosmetic we logged in the Horizon web interface as admin and changed the name of the subnet to 10.0.0.0/16.

Step 5: Adjust the allocation pool for the subnet

Now that the subnet is wider, the neutron client will let you configure a wider allocation pool. First check the existing allocation pool:

$ neutron subnet-list | grep 2e06c039-b715-4020-b609-779954fa4399

| 2e06c039-b715-4020-b609-779954fa4399 | 10.0.0.0/16     | 10.0.0.0/16      | {"start": "10.0.0.2", "end": "10.0.0.254"}           |

You can resize easily the allocation pool like this:

neutron subnet-update 2e06c039-b715-4020-b609-779954fa4399 --allocation-pool start='10.0.0.2',end='10.0.255.254'

Step 6: Check status of the VMs

At this point the new instances will get an IP address from the new allocation pool.

As for the existing instances, they will continue to work with the /24 address mask. In case of reboot they will get via DHCP the same IP address but with the new address mask. Also, when the DHCP lease expires, depending on the DHCP client implementation, they will hopefully get the updated netmask. This is not the case with the default Ubuntu dhclient, that will not refresh the netmask when the IP address offered by the DHCP server does not change.

The worst case scenario is when the machine keeps the old /24 address mask for a long time. The outbound traffic to other machines in the private network might experience a suboptimal routing through the network node, that will be used as a default gateway.

Conclusion

We successfully expanded a Neutron network to a wider IP range without service interruption. Understanding Neutron internals it is possible to make changes that go beyond the features of Neutron. It is very important to understand how the values in the Neutron database are used to create the network namespaces.

We understood that a better design for our cloud would be to have a default Neutron network per tenant, instead of a shared default network for all tenants.


1 Comment

Buffering issues when publishing Openstack dashboard and API services behind a HTTP reverse proxy

At SWITCH we operate SWITCHengines, a public OpenStack cloud for Swiss universities. To expose our services to the public Internet, we use the popular open source nginx reverse proxy. For the sake of simplicity we show in the following figure a simplified schema of our infrastructure, with only the components relevant to this article. Every endpoint API service and the Horizon Dashboard are available behind a reverse proxy.

SWITCHEngines reverse proxy

SWITCHEngines reverse proxy

The Problem:

Our users reported not being able to upload images using the Horizon web interface when images were large files over 10GB.

We did some tests ourselves and we noticed that the image upload process was too slow. Looking at log files, we noticed that the upload process for an image of 10GB was slow enough to make the Keystone auth token expire before the end of the process.
Why was uploading so slow ?

The analysis:

The usual scenario for the reverse proxy is load balancing to a pool of web servers. There is a slow network, like the Internet, between the users and the proxy, and there is a fast network between the proxy and the web servers.

Typical Reverse Proxy Architecture

Typical Reverse Proxy Architecture

The goal is to keep the Web Servers busy the smallest possible amount of time serving the client requests. To achieve this goal, the reverse proxy buffers the requests, and interacts with the web server only when the request is completely cached. Then the web server interacts only with the proxy on the fast network and gets rid of the latency of the slow network.
If we look at the default settings of Nginx we note that proxy_buffering is enabled.

When buffering is enabled, nginx receives a response from the proxied server as soon as possible, saving it into the buffers set by the proxy_buffer_size and proxy_buffers directives. If the whole response does not fit into memory, a part of it can be saved to a temporary file on the disk. Writing to temporary files is controlled by the proxy_max_temp_file_size and proxy_temp_file_write_size directives.

However the proxy_buffering configuration directive refers to the traffic from the web server to the user, the HTTP response that is the largest traffic when a user wants to download a web page.

In our case the user is uploading an image to the web server and the largest traffic is in the user request, not in the server response. Luckily in nginx 1.7 a new configuration option has been introduced: proxy_request_buffering

This is also enabled by default:

When buffering is enabled, the entire request body is read from the client before sending the request to a proxied server.
When buffering is disabled, the request body is sent to the proxied server immediately as it is received. In this case, the request cannot be passed to the next server if nginx already started sending the request body.

But what happens if the user’s network is also a very fast network such as SWITCHlan? And does it make sense to have such large buffers for big files over 10GB ?

Let’s see what happens when a users tries to upload an image from his computer to Glance using the Horizon web interface. You will be surprised to know that the image is buffered 3 times.

Components involved in the image upload process

Components involved in the image upload process

The user has to wait for the image to be fully uploaded to the first nginx server in front of the Horizon server, then the Horizon application stores completely the image again. At this point the public API Glance is again published behind a nginx reverse proxy and we have to wait again the time to buffer the image and then finally the last transfer to Glance.

This 3 times buffering leads to 4 upload operations from one component to another. A 10GB images then requires 10GB on the Internet and 30GB of machine to machine traffic in the OpenStack LAN.

The solution:

Buffering does not make sense in our scenario and introduces long waiting times for the buffers to get filled up.

To improve this situation we upgraded nginx to 1.8 and we configured both proxy_buffering and proxy_request_buffering to off. With this new configuration the uploaded images are buffered only once, at the Horizon server. The process of image upload with web interface is now reasonably fast and we don’t have Keystone auth tokens expiring anymore.