SWITCH Cloud Blog


SDN/NFV Paradigm: Where the Industry meets the Academia

On Thursday, 16th of June 2016, the Software Defined Networking (SDN) Switzerland community met up for their SDN Workshop collocated with the Open Cloud Day at ZHAW, School of Engineering in Winterthur. It was the first time where the SDN workshop stood as a separate and dedicated track at a larger event, where we expected synergy between cloud computing and SDN topics, especially from the industry point of view. The participants were free to attend both, the Open Cloud Day main track and/or the SDN event.

The objectives of the SDN workshop were basically the same as the last time, to share knowledge, have hands on sessions, pushing best practices, implementations supporting operations, and presentations of current research topics and prototypes in SDN. This time I might have to say that the focus was on SDNization, sometimes called “Softwarization, which will have a deep impact on the techno-economic aspects. Examples of Softwarization are reducing costs on digitalizing and automation of processes as optimizing the usage of resources and creating new forms of coordination and also competition on the value-chain. The consequence of that is growing up new business models.

Up to 30 experts mostly from the Industry and academia took interest in the SDN session, (Slides). Lively discussions came up on SDN/NFV infrastructure deployments, service presentations and implementation forms, establishing micro service architectures, making cloud native applications more agile, and how to open the field to innovation.

Furthermore, two open source projects ONOS and CORD were introduced and discussed. These two conceptual architecture approaches set-up modularly and address scalability, high availability, performance, and a NB/SB abstraction, which allow communities providing research and production networks, and to establish their infrastructure leveraging open source software and white boxes. First SDN deployments powered by ONOS started at the beginning of 2015 with GEANT, GARR in Europe, Internet2 (USA), and AMLIGHT and FIU in  South-America. The target audience of such a Global SDN network deployment are RENs, network operators, and users. The motivation enforcing a Global SDN network are manifold, but can be summarized in (1) enabling network and service innovation and (2) learning and improvements through an agile collaboration model. Furthermore, new apps enabling network innovation – Castor (providing L2/L3 connectivity for SDX), SDN-IP (transforming a SDN into a transit IP network that means SDN AS using BGP for communication between the neighbors, and L3 connectivity without legacy router), SDX L2/L3 and VPLS can be offered.

Another trend was interesting to follow: Combining open source tools/technologies, where e.g Snabb switch technology meets vMX a full-featured carrier grade router within a Docker container for elaborating a high performance carrier grade lightweight 4over6. In a demonstration, a service directory licensing model that delivers a vADC (virtual Application Delivery Controller) as a service solution was presented.

The answer to the question “What can the network do for clouds” was shown by the contribution of Cumulus networks. With the Cumulus VX and OpenStack topology routing to the host allows server admins to utilize multipath capabilities on the server by using multiple uplinks and with this to take an active role without bounds on L2 networking. Various stages on cumulus network implementations are possible – from a full MLAG (Multi Layer Chassi Aggregation) Fabric with MLAG in the backbone, using LACP (Link Aggregation Control Protocol) from the servers and L2 connectivity with limited scalability, to a full Layer 3 fabric with high capability, scalability networking, and IP Fabric to the hosts with cumulus quagga improvements.

In context of Big-Data and applications like Hadoop and MapReduce in a 10-100 Gb/s data centre network, there are a lot of monitoring challenges. One of them is the requirement for faster and scalable network monitoring methods. Deep understanding on workflows and their communication patterns is essential for designing future data centers. Since the current existing monitoring tools show outdated software, high-speed resolution and non-intrusive monitoring in the data plane has to be addressed. Thus a high resolution network monitoring architecture, called zMon,  that targets large-scale data centre and IXP networks was presented and discussed.

The statement “A bird in the hand is worth two in the bush”…means boosting existing networks with SDN, and implies the question – Wouldn’t it be nice to be able to apply the SDN principles on top of existing network architectures? – The answer would be YES, but how to transfer a legacy network to a SDN enabled environment?…Three issues were discussed: (1) SDN requires to upgrade network devices, (2) to upgrade management systems and (3) “to upgrade” network operators. So using SDN would mean: (a) small investment that is providing benefits under partial deployments (mostly one single switch), (b) low risk means minimal impact on operational practices and being compatible with existing technologies, and last but not least  (c) high return, means to solve a timely problem. So two approaches were presented (A) Fibbing (intra-domain routing), an architecture that allows central control of router’s forwarding-table over distributed routing and (B) SDX (inter-domain routing), an approach that highlights flexible Internet policies, and open, flexible APIs. Special attention got Fibbing (lying the network), where flexibility, expressivity and manageability (the advantages of SDN) are combined. Technically spoken: Fake nodes and links will be introduced into an underlying link-state routing protocol (e.g. OSPF) so that routers compute their own forwarding tables based on the extended (faked/real) network topology. Flexible load balancing, traffic engineering and backup routes can be solved by Fibbing.

With “Supercharging SDN security” two project were presented: (1) A secure SDN architecture to confine damage due to a compromised controller and switch by isolating their processes and (2) an SDN security extension for path enforcement and path validation. In (1) the motivation of the discussed work was set-up upon the OpenDaylight Netconf’s vulnerability (DoS, Information disclosure, Topology spoofing),  the  ONOS deserialization bug (DoS), and privilege escalation on Cisco Application Policy Infrastructure Controller and the Cisco Nexus 9000 Series ACI Mode Switch. A question came up on what a secure SDN architecture should look like. The architecture building blocks are set-up on the principle of isolated virtual controllers and switches, connected by an SB API (OpenFlow Channel) among each other. As the isolation is per tenant it’s not possible to communicate through isolated environments thanks to TLS (Transport Layer Security). A prototype implementation was demonstrated. Stated that “a driver is lack of data plane accountability”, means no policy enforcement and data plane policy validation mechanism are in place. Thus an SDN security extension should support the enforcement of network path and reactively inspect the data-plane behavior. Path enforcement mechanism, path validation procedures were deployed and evaluation based on network latency and throughput.  

As a conclusion the SDN/NFV infrastructure deployments come closer to vendors, and ISPs understanding of SDN, e.g. focusing vDC, DC use cases. The question “What can the network do for Clouds”, became important on vendor specific implementations using OpenDaylight, Docker, OpenStack, etc., where services will be orchestrated and provided among a cloud network topology. Furthermore, there was a huge discussion on NB and SB APIs supporting control and data plane programmability. OpenFlow as the traditional SB API provides only a simple “match-action” paradigm and lacks on the function of stateful processing in the SDN data plane. So more flexibility pointed out in P4 – protocol independence (P4, a program language, which define how a switch process packets), target independence (P4, which allows to describe everything from high performance ASICs to virtual switches), and field reconfigurability (P4 that allows to reconfigure the progress of packets into switches) is needed and will be presented in future SDN approaches. Further, combining of open source tools/software with closed source/proprietary protocols does fasten the SDNization process, and brings together researcher, DevOps teams and the Industry.

The SDN Switzerland Group is an independent initiative introduced by SWITCH and the ICCLab (ZHAW) in 2013. The aim of the SDN Switzerland is to organize SDN workshops addressing topics from Research, academic ICTs (Operations), and the Industry (Implementation forms on SDN). This constitution allow us to bring together knowledge and use synergy of an interdisciplinary group for future steps also in collaboration.


IPv6 Finally Arriving on SWITCHengines

As you may have heard or noticed, the Internet is running out of addresses. It’s time to upgrade from the 35 years old IPv4 protocol, which doesn’t even have a single public address per human on the earth, to the brand new (?) IPv6, which offers enough addresses for every grain of sand in the known universe, or something like that.

SWITCH is a pioneer in IPv6 adoption, and has been supporting IPv6 on all network connections and most services in parallel with IPv4 (“dual stack”) for many years.

To our embarrassment, we hadn’t been able to integrate IPv6 support into SWITCHengines from the start. While OpenStack had some IPv6 support, the implementation wasn’t mature, and we didn’t know how to fit it into our network model in a user-friendly way.

IPv6: “On by default” and globally routable

About a month ago we took a big step to change this: IPv6 is now enabled by default for all instances on the shared internal network (“private”).  So if you have an instance running on SWITCHengines, and it isn’t connected to a tenant network of your own, then the instance probably has an IPv6 address right now, in addition to the IPv4 address(es) it always had.  Note that this is true even for instances that were created or last rebooted before we turned on IPv6. On Linux-derived systems you can check using ifconfig eth0 or ip -6 addr list dev eth0; if you see an address that starts with 2001:620:5ca1:, then your instance can speak IPv6.

Note that these IPv6 addresses are “globally unique” and routable, i.e. they are recognized by the general Internet.  In contrast, the IPv4 addresses on the default network are “private” and can only be used locally inside the cloud; communication with the general Internet requires Network Address Translation (NAT).

What you can do with an IPv6 address

Your instance will now be able to talk to other Internet hosts over IPv6. For example, try ping6 mirror.switch.ch or traceroute6 www.facebook.com. This works just like IPv4, except that only a subset of hosts on the Internet speaks IPv6 yet. Fortunately, this subset already includes important services and is growing.  Because IPv6 doesn’t need NAT, routing between your instances and the Internet is less resource-intensive and a tiny bit faster than with IPv4.

But you will also be able to accept connections from other Internet hosts over IPv6. This is different from before: To accept connections over IPv4, you need(ed) a separate public address, a Floating IP in OpenStack terminology.  So if you can get by with IPv6, for example because you only need (SSH or other) access from hosts that have IPv6, then you don’t need to reserve a Floating IP anymore.  This saves you not just work but also money—public IPv4 addresses are scarce, so we need to charge a small “rent” for each Floating IP reserved.  IPv6 addresses are plentiful, so we don’t charge for them.

But isn’t this dangerous?

Instances are now globally reachable by default, but they are still protected by OpenStack’s Security Groups (corresponding to packet filters or access control lists).  The default Security Group only allows outbound connections: Your instance can connect to servers elsewhere, but attempts to connect to your instance will be blocked.  You have probably opened some ports such as TCP port 22 (for SSH) or 80 or 443 (for HTTP/HTTPS) by adding corresponding rules to your own Security Groups.  In these rules, you need to specify address “prefixes” specifying where you want to accept traffic from.  These prefixes can be IPv4 or IPv6—if you want to accept both, you need two rules.

If you want to accept traffic from anywhere, your rules will contain 0.0.0.0/0 as the prefix. To accept IPv6 traffic as well, simply add identical rules with ::/0 as the prefix instead—this is the IPv6 version of the “global” prefix.

What about domain names?

These IPv6 addresses can be entered in the DNS using “AAAA” records. For Floating IPs, we provided pre-registered hostnames of the form fl-34-56.zhdk.cloud.switch.ch. We cannot do that in IPv6, because there are just too many possible addresses. If you require your IPv6 address to map back to a hostname, please let us know and we can add it manually.

OpenStack will learn how to (optionally) register such hostnames in the DNS automatically; but that feature was only added to the latest release (“Mitaka”), and it will be several months before we can deploy this in SWITCHengines.

Upcoming

We would like to also offer IPv6 connectivity to user-created “tenant networks”. Our version of OpenStack almost supports this, but it cannot be fully automated yet. If you need IPv6 on your non-shared network right now, please let us know via the normal support channel, and we’ll set something up manually. But eventually (hopefully soon), getting a globally routable IPv6 prefix for your network should be (almost) as easy as getting a globally routable Floating IP is now.

You can also expect services running on SWITCHengines (SWITCHdrive, SWITCHfilesender and more) to become exposed over IPv6 over the next couple of months. Stay tuned!


New version, new features

We are constantly working on SWITCHengines, updating, tweaking stuff. Most of the time, little of this process is visible to users, but sometimes we release features that make a difference in the user experience.

A major change was the upgrade to OpenStack Kilo that we did mid March. OpenStack is the software that powers our cloud, and it gets an update every 6 months. The releases are named alphabetically. Our clouds history started with the “Icehouse” release, moved to “Juno” and now we are on “Kilo”. Yesterday “Mitaka” was released, so we are 2 releases (or 12 months) behind.

Upgrading the cloud infrastructure is major work. Our goal is to upgrade “in place” with all virtual machines running uninterrupted during the upgrade. Other cloud operators install a new version on minimal hardware, then start to migrate the customer machines one by one to the new hardware, and converting the hypervisors one by one. This is certainly feasible, but it causes downtime – something we’d like to avoid.

Therefore we spend a lot of time, testing the upgrade path. The upgrade from “Icehouse” to “Juno”took over 6 months (first we needed to figure out how to do the upgrade in the first place, then had to implement and test it). The upgrade from “Juno” to “Kilo” then only took 4 months (with x-mas and New Year in it). Now we are working on the upgrade to “Liberty” which is planned to happen before June / July. This time, we plan to be even faster, because we are going to upgrade the many components of OpenStack individually. The just release “Mitaka” release should be done before “Newton” is release in October. Our plan is to be at most 6 months behind the official release schedule.

So what does Kilo bring you, the end user? A slightly different user interface, loads of internal changes and a few new major features:

There is also stuff coming in the next few weeks:

  • Access to the SWIFT object store
  • Backup of Volumes (that is something we are testing right now)
  • IPv6 addresses for virtual machines

We have streamlined the deployment process of changes – while we did releases once a week during the last year, we now can deploy new features as soon as they are finished and tested.

 


Backport to Openstack Juno the CEPH rbd object map feature

How we use Ceph at SWITCHengines

Virtual machines storage in the OpenStack public cloud SWITCHengines is provided with Ceph. We run a Ceph cluster in each OpenStack region. The compute nodes do not have any local storage resource, the virtual machines will access their disks directly over the network, because libvirt can act as a Ceph client.

Using Ceph as the default storage for glance images, nova ephemeral disks, and cinder volumes, is a very convenient choice. We are able to scale the storage capacity as needed, regardless of the disk capacity on the compute nodes. It is also easier to live migrate nova instances between compute nodes, because the virtual machine disks are not local to a specific compute node and they don’t need to be migrated.

The performance problem

The load on our Ceph cluster constantly increases, because of a higher number of Virtual Machines running everyday. In October 2015 we noticed that deleting cinder Volumes became a very slow operation, and the bigger were the cinder volumes, the longer the time you had to wait. Moreover, users orchestrating heat stacks faced real performance problems when deleting several disks at once.

To identify where the the bottleneck had his origin, we measured how long it took to create and delete rbd volumes directly with the rbd command line client, excluding completely the cinder code.

The commands to do this test are simple:

time rbd -p volumes create testname --size 1024 --image-format 2
rbd -p volumes info testname
time rbd -p volumes rm testname

We quickly figured out that it was Ceph itself being slow to delete the rbd volumes. The problem was well known and already fixed in the Ceph Hammer release, introducing a new feature: the object map.

When the object map feature is enabled on an image, limiting the diff to the object extents will dramatically improve performance since the differences can be computed by examining the in-memory object map instead of querying RADOS for each object within the image.

http://docs.ceph.com/docs/master/man/8/rbd/

In our practical experience the time to delete an images decreased from several minutes to few seconds.

How to fix your OpenStack Juno installation

We changed the ceph.conf to enable the object map feature as described very well in the blog post from Sébastien Han.

It was great, once the ceph.conf had the following two lines:

rbd default format = 2
rbd default features = 13

We could immediately create new images with object map as you see in the following output:

rbd image 'volume-<uuid>':
    size 20480 MB in 2560 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.<prefix>
    format: 2
    features: layering, exclusive, object map
    flags:
    parent: images/<uuid>@snap
    overlap: 1549 MB

We were so happy it was so easy to fix. However we soon realized that everything worked with the rbd command line, but all the Openstack components where ignoring the new options in the ceph.conf file.

We started our investigation with Cinder. We understood that Cinder does not call the rbd command line client at all, but it relies on the rbd python library. The current implementation of Cinder in Juno did not know about these extra features so it was just ignoring our changes in ceph.conf. The support for the object map feature was introduced only with Kilo in commit 6211d8.

To quickly fix the performance problem before upgrading to Kilo, we decided to backport this patch to Juno. We already carry other small local patches in our infrastructure, so it was in our standard procedure to add yet another patch and create a new .deb package. After backporting the patch, Cinder started to create volumes correctly honoring the options on ceph.conf.

Patching Cinder we fixed the problem just with Cinder volumes. The virtual machines started from ephemeral disks, run on ceph rbd images created by Nova. Also the glance images uploaded by the users are stored in ceph rbd volumes by the glance, that relies on the glance_store library.

At the end of the story we had to patch three openstack projects to completely backport to Juno the ability to use the Ceph object map feature. Here we publish the links to the git branches and packages for nova glance_store and cinder

Conclusion

Upgrading every six months to keep the production infrastructure on the current Openstack release is challenging. Upgrade without downtime needs a lot of testing and it is easy to stay behind schedule. For this reason most Openstack installations today run on Juno or Kilo.

We release these patches for all those who are running Juno because the performance benefit is stunning. However, we strongly advise to plan an upgrade to Kilo as soon as possible.

 


3 Comments

Upgrading a Ceph Cluster from 170 to 200 Disks, in One Image

The infrastructure underlying SWITCHengines includes two Ceph storage clusters, one in Lausanne and one in Zurich. The Zurich one (which notably serves SWITCHdrive) filled up over the past year. In December 2015 we acquired new servers to upgrade its capacity.

The upgrade involves the introduction of a new “leaf-spine” network architecture based on “whitebox” switches and Layer-3 (IP) routing to ensure future scalability. The pre-existing servers are still connected to the “old” network consisting of two switches and a single Layer 2 (Ethernet) domain.

First careful steps: 160→161→170

This change in network topology, and in particular the necessity to support both the old and new networks, caused us to be very careful when adding the new servers. The old cluster consisted of 160 Ceph OSDs, running on sixteen servers with ten 4TB hard disks each. We first added a single server with a single disk (OSD) and observed that it worked well. Then we added nine more OSDs on that first new server to bring the cluster total up to 170 OSDs. That also worked flawlessly.

Now for real: 170→200

As the next step, we added three new servers with ten disks each to the cluster at once, to bring the total OSD count from 170 to 200. We did this over the weekend because it causes a massive shuffling of data within the cluster, which slows down normal user I/O.

What should we expect to happen?

All in all, 28.77% of the existing storage objects in the system had to be migrated, corresponding to about 106 Terabytes of raw data. Most of the data movement is from the 170 old towards the 30 new disks.

How long should this take? One can make some back-of-the-envelope calculations. In a perfect world, writing 106 Terabytes to 30 disks, each of which sustains a write rate of 170 MB/s, would take around 5.8 hours. In Ceph, every byte written to an OSD has to go through a persistent “journal”, which is implemented using an SSD (flash-based solid-state disk). Our systems have two SSDs, each of which sustains a write rate of about 520 MB/s. Taking this bottleneck into account, the lower bound increases to 9.5 hours.

However this is still a very theoretical number, because it fails to include many other bottlenecks and types of overhead: disk controller and bus capacity limitations, processing overhead, network delays, reading data from the old disks etc. But most importantly, the Ceph cluster is actively used, and performs other maintenance tasks such as scrubbing, all of which competes with the movement of data to the new disks.

What do we actually see?

Here is a graph that illustrates what happens after the 30 new disks (OSDs) are added:

df_170+30

The y axis is the disk usage (as per output of the df command). The thin grey lines—there are 170 of them—correspond to each of the old OSDs. The thin red lines correspond to the 30 new OSDs. The blue line is the average disk usage across the old OSDs, the green line the average of the new OSDs. At the end of the process, the blue and green line should (roughly) meet.

So in practice, the process takes about 30 hours. In perspective, this is still quite fast and corresponds to a mean overall data-movement rate of about 1 GB/s or 8 Gbit/s. The green and blue lines show that the overall process seems very steady as it moves data from the old to the new OSDs.

Looking at the individual line “bundles”, we see that the process is not all that homogeneous. First, even within in the old line bundle, we see quite a bit of variation across the fill levels of the 170 disks. There is some variation at the outset, and it seems to get worse throughout the process. An interesting case is the lowest grey line—this is an OSD that has significantly less data that the others. I had hoped that the reshuffling would be an opportunity to make it approach the others (by shedding less data), but the opposite happened.

Anyway, a single under-utilized disk is not a big problem. Individual over-utilized disks are a problem, though. And we see that there is one OSD that has significantly higher occupancy. We can address this by explicit “reweighting” if and when this becomes a problem as the cluster fills up again. But then, we still have a couple of disk servers that we can add to the cluster over the coming months, to make sure that overall utilization remains in a comfortable range.

Coda

The graph above has been created using Graphite with the following graph definition:

[
 {
 "target": [
 "lineWidth(alpha(color(collectd.zhdk00{06,07,11,15,17,18,19,20,21,22,23,24,27,29,30,32,43}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used,'black'),0.5),0.5)",
 "lineWidth(color(collectd.zhdk00{44,51,52}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used,'red'),0.5)",
 "lineWidth(color(avg(collectd.zhdk00{06,07,11,15,17,18,19,20,21,22,23,24,27,29,30,32,43}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used),'blue'),2)",
 "lineWidth(color(avg(collectd.zhdk00{44,51,52}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used),'green'),2)"
 ],
 "height": 600
 }
]

df_160+1+9+30The base data was collected by CollectD’s standard “df” plugin. zhdk00{44,51,52} are the new OSD servers, the others are the pre-existing ones.

Zooming out a bit shows the previous small extension steps mentioned above. As you see, adding nine disks doesn’t take much longer than adding a single one.

 

 


Server Power Measurement: Quick Experiment

In December 2015, we received a set of servers to extend the infrastructure that powers SWITCHengines (and indirectly SWITCHdrive, SWITCHfilesender and other services).  Putting these in production will take some time, because this also requires a change in our network setup, but users should start benefiting from it starting in February.

Before the upgrade, we used a single server chassis type for both “compute” nodes—i.e. where SWITCHengines instances are executed as virtual machines—and “storage” nodes where all the virtual disks and other persistent objects are stored.  The difference simply was that some servers were full of high-capacity disks, where the others had many empty slots.  We knew this was wasteful in terms of rack utilization, but it gave us more flexibility while we were learning how our infrastructure was used.

The new servers are different: Storage nodes look very much like the old storage nodes (which, as mentioned, look very similar to the old compute nodes), just with newer motherboard and newer (but also fewer and less powerful) processors.

The compute nodes are very different though: The chassis have the same size as the old ones, but instead of one server or “node”, the new compute chassis contain four.  All four nodes in a chassis share the same set of power supplies and fans, two of each for redundancy.

Now we use tools such as IPMI to remotely monitor our infrastructure to make sure we notice when fans or power supplies fail, or temperature starts to increase to concerning levels.  Each server has a “Baseboard Management Controller” (BMC) that exposes a set of sensors for that.  The BMC also allows resetting or even powering down/up the server (except for the BMC itself!), and getting to the serial or graphical console over the network, all of which can be useful for maintenance.

Each node has its own BMC, and each BMC gives sensor information about the (two) power supplies.  This is a little weird because there are only two power supplies in the chassis, but we can monitor eight—two per node/BMC, of which there are four.  Which raises some doubts: Am I measuring the two power supplies in the chassis at all? Or are the measurements from some kind of internal power supplies that each node has (and that feeds from the central power supplies)?

As a small experiment, I started with a chassis that had all four nodes powered up and running.  I started polling the power consumption readings on one of the four servers every ten seconds.  While that was running, I shut down the three other servers.  Here are the results:

$ while true; do date; \
  sudo ipmitool sensor list | grep 'Power In'; \
  sleep 8; done
Thu Jan 14 12:53:34 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:43 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:53 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:02 CET 2016
PS1 Power In | 320.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:11 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:20 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:30 CET 2016
PS1 Power In | 180.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:39 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:48 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na

One observation is that the resolution of the power measurement seems to be 10W.  Another observation is that PS2 consistently draws 10W—which might mean anything between 5 and 15.  Obviously the two power supplies function in active/standby modes and PS1 is the active one.

But the central result is that the power draw of PS1 falls from 310W when all four nodes are running (but not really doing much outside running the operating system) to 110W when only one is running.  This suggests that we’re actually measuring the shared power supplies, and not something specific to the node we were polling.  It also suggests that each node consumes about 70W in this “baseline” state, and that there is a base load of 40W for the chassis.  Of course these numbers are highly unscientific and imprecise, given the trivial number of experiments (one) and the bad sensor resolution and, presumably, precision.

view from Lungarno Pacinotti on the river Arno


Impressions from 19th TF-Storage workshop in Pisa

National Research and Education Networks (NRENs) such as SWITCH exist in every European country. They have a long tradition of working together. An example for this are Task Forces on different topics under the umbrella of the GÉANT Association (formerly TERENA). One of them is TF-Storage, which since 2008 has been a forum to exchange knowledge about various storage technologies and their application in the NREN/academic IT context. Its 19th meeting took place in Pisa last week (13/14 October). It was the first one that I attended on site. But I had been following the group via its mailing list for several years, and the agenda included several topics relevant to our work, so I was looking forward to learning from the presentations and to chatting with people from other NRENs (and some universities) who run systems similar to ours.

Getting there

Zurich is extremely well connected transport-wise, but getting to Pisa without spending an extra night proved to be challenging. I decided to take an early flight to Florence, then drive a rented car to Pisa. That went smoothly until I got a little lost in the suburbs of Pisa, but after two rounds on the one-way lungarni (Arno promenades) I finally had the car parked at the hotel and walked the 100m or so to the venue at the university. Unfortunately I arrived at the meeting more than an hour after it had started.

view from Lungarno Pacinotti on the river Arno

View of the river Arno from Lungarno Pacinotti. The meeting venue is one of the buildings on the right.

Day 1: Ceph, Ceph, Ceph…

The meeting started with two hours of presentations by Joao Eduardo Luis from SUSE about various aspects of Ceph, a distributed file system that we use heavily in SWITCHengines. In the part that I didn’t miss, Joao talked about numerous new features in different stages of development. Sometimes I think it would be better to make the current functionality more robust and easier to use. Especially the promise of more tuning knobs being added seems unattractive to me—from an operator’s point of view it would be much nicer if less tuning were necessary.

The ensuing round-table discussion was interesting. Clearly several people in the room had extensive experience with running Ceph clusters. Especially Panayiotis Gotsis from GRNET asked many questions which showed a deep familiarity with the system.

Next, Axel Rosenberg from Sandisk talked about their work on optimizing Ceph for use with Flash (SSD) storage. Sandisk has built a product called “IFOS” based on Ubuntu GNU/Linux and an enhanced version of Ceph. They identified many bottlenecks in the Ceph code that show up when the disk bottleneck is lifted by use of fast SSDs. Sandisk’s changes resulted in speedup of some benchmarks by a factor of ten—notably with the same type of disks. The improvements will hopefully find their way into “upstream” Ceph and be thoroughly quality-assured. The most interesting slide to me was about work to reduce the impact of recovery from a failed disk. By adding some priorization (I think), they were able to massively improve performance of user I/O during recovery—let’s say rather than being ten times slower than usual, it would only be 40% slower—while the recovery process took only a little bit longer than without the priorization. This is an area that needs a lot of work in Ceph.

Karan Singh from CSC (which is “the Finnish SWITCH”, but also/primarily “the Finnish CSCS”) presented how CSC uses Ceph as well as their Ceph dashboard. Karan has actually written a book on Ceph! CSC plans to use Ceph as a basis for two OpenStack installations, cPouta (classic public/community cloud service) and ePouta (for sensitive research data). They have been doing extensive research of Ceph including some advanced features such as Erasure Coding—which we don’t consider for SWITCHengines just yet. Karan also talked about tuning the system and diagnosing issues, which can lead to discover low-level problems such as network cabling issues in one case he reported.

Simone Spinelli from the hosting university of Pisa talked about how they use Ceph to support an OpenStack based virtual machine hosting service. I discovered that they did many things in a similar way to us, using Puppet, Foreman, Graphite to support installation and operation of their system. An interesting twist is they have multiple smaller sites distributed across the city, and their Ceph cluster spans these sites. In contrast, at SWITCH we operate separate clusters in our two locations in Lausanne and Zurich. There are several technical reasons for doing so, although we consider adding a third cluster that would span the two locations (and adding a tiny third one) for special applications that require resilience against the total failure of a data center or its connection to the network.

Day 2: Scality, OpenStack, ownCloud

The second day was opened by Bradley King from Scality presenting on object stores vs. file stores. This was a wonderful presentation that would be worth a blog post of its own. Although it was naturally focused on Scality’s “RING” product, it didn’t come over as marketing at all, and contained many interesting insights about distributed storage design trade-offs, stories from actual deployments—Scality has several in the multi-Petabyte range—and also some future perspectives, for example about “IP drives”. These are disk drives with Ethernet/IP interfaces rather than the traditional SATA or SAS attachments, and which support S3-like object interfaces. What was new to me was that new disk technologies such as SMR (shingled magnetic recording) and HAMR (heat-assisted magnetic recording) seem to be driving disk vendors towards this kind of interface, as traditional block semantics are becoming quite hard to emulate with these types of disk. My takeaway was that Scality RING looks like a well-designed system, similarly elegant as Ceph, but with some trade-offs leaning towards simplicity and operational ease. To me the big drawback compared to Ceph is that it (like several other “software-defined storage” systems) is closed-source.

The following three were about collaboration activities between NRENs (and, in some cases, vendors):

Maciej Brzeźniak from PSNC (the Polish “SWITCH+CSCS”) talked about the TCO Calculator for (mainly Ceph-based) software-defined storage systems that some TF-Storage members have been working on for several months. Maciej is looking for more volunteers to contribute data to it. One thing that is missing are estimates for network (port) costs. I volunteered to provide some numbers for 10G/40G leaf/spine networks built from “whitebox” switches, because we just went through a procurement exercise for such a project.

Next, yours truly talked about the OSO get-together, a loosely organized group of operators of OpenStack-based IaaS installations that meets every other Friday over videoconferencing. I talked about how the group evolved and how it works, and suggested that this could serve as a blueprint for closer cooperation between some TF-Storage members on some specific topics like building and running Ceph clusters. Because there is significant overlap between the OSO (IaaS) and (in particular Ceph) storage operators, we decided that interested TF-Storage people should join the OSO mailing list and the meetings, and that we see where this will take us. [The next OSO meeting was two days later, and a few new faces showed up, mostly TF-Storage members, so it looks like this could become a success.]

Finally Peter Szegedi from the GÉANT Association talked about the liaison with OpenCloudMesh, which is one aspect of a collaboration of various NRENs (including AARnet from Australia) and other organizations (such as CERN) who use the ownCloud software to provide file synchronization and sharing service to their users. SWITCH also participates in this collaboration, which lets us share our experience running the SWITCHdrive service, and in return provides us with valuable insights from others.

The meeting closed with the announcement that the next meeting would be in Poznań at some date to be chosen later, carefully avoiding clashes with the OpenStack meeting in April 2016. Lively discussions ensued after the official end of the meeting.

Getting back

Driving back from Pisa to Florence airport turned out to be interesting, because the rain, which had been intermittent, had become quite heavy during the day. Other than that, the return trip was uneventful. Unfortunately I didn’t even have time to see the leaning tower, although it would probably have been a short walk from the hotel/venue. But the tiny triangle between meeting venue, my hotel, and the restaurant where we had dinner made a very pleasant impression on me, so I’ll definitely try to come back to see more of this city.

rainy-small

Waiting if the car in front of me makes it safely through the flooded stretch under the bridge… yup, it did.


Hack Neutron to add more IP addresses to an existing subnet

When we designed our OpenStack cloud at SWITCH, we created a network in the service tenant, and we called it private.

This network is shared with all tenants and it is the default choice when you start a new instance. The name private comes from the fact that you will get a private IP via dhcp. The subnet we choosed for this network is the 10.0.0.0/24. The allocation pool goes from 10.0.0.2 to 10.0.0.254 and it can’t be enlarged anymore. This is a problem because we need IP addresses for many more instances.

In this article we explain how we successfully enlarged this subnet to a wider range: 10.0.0.0/16. This operation is not a feature supported by Neutron in Juno, so we show how to hack into Neutron internals. We were able to successfully enlarge the subnet and modify the allocation pool, without interrupting the service for the existing instances.

In the following we assume that the network we are talking about has only 1 router, however this procedure can be easily extended to more complex setups.

What you should know about Neutron, is that a Neutron network has two important namespaces in the OpenStack network node.

  • The qrouter is the router namespace. In our setup one interface is attached to the private network we need to enlarge and a second interface is attached to the external physical network.
  • The qdhcp name space has only 1 interface to the private network. On your OpenStack network node you will find that a dnsmasq process is running bound to this interface to provide IP addresses via DHCP.
Neutron Architecture

Neutron Architecture

In the figure Neutron Architecture we try to give an overview of the overall system. A Virtual Machine (VM) can run on any remote compute node. The compute node has a Open vSwitch process running, that collects the traffic from the VM and with proper VXLAN encapsulation delivers the traffic to the network node. The Open vSwitch at the network node has a bridge containing both the qrouter namespace internal interface and the qdhcp namespace, this will make the VMs see both the default gateway and the DHCP server on the virtual L2 network. The qrouter namespace has a second interface to the external network.

Step 1: hack the Neutron database

In the Neutron database look for the subnet, you can easily find your subnet in the table matching the service tenant id:

select * from subnets WHERE tenant_id='d447c836b6934dfab41a03f1ff96d879';

Take note of id (that in this table is the subnet_id) and network_id of the subnet. In our example we had these values:

id (subnet_id) = 2e06c039-b715-4020-b609-779954fa4399
network_id = 1dc116e9-1ec9-49f6-9d92-4483edfefc9c
tenant_id = d447c836b6934dfab41a03f1ff96d879

Now let’s look into the routers database table:

select * from routers WHERE tenant_id='d447c836b6934dfab41a03f1ff96d879';

Again filter for the service tenant. We take note of the router ID.

 id (router_id) = aba1e526-05ca-4aca-9a80-01601cdee79d

At this point we have all the information we need to enlarge the subnet in the Neutron database.

update subnets set cidr='NET/MASK' WHERE id='subnet_id';

So in our example:

update subnets set cidr='10.0.0.0/16' WHERE id='2e06c039-b715-4020-b609-779954fa4399';

Nothing will happen immediately after you update the values in the Neutron mysql database. You could reboot your network node and Neutron would rebuild the virtual routers with the new database values. However, we show a better solution to avoid downtime.

Step 2: Update the interface of the qrouter namespace

On the network node there is a namespace qrouter-<router_id> . Let’s have a look at the interfaces using iproute2:

sudo ip netns exec qrouter-(router_id) ip addr show

With the values in our example:

sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr show

You will see the typical Linux output with all the interfaces that live in this namespace. Take note of the interface name with the address 10.0.0.1/24 that we want to change, in our case

 qr-396e87de-4b

Now that we know the interface name we can change IP address and mask:

sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr add 10.0.0.1/16 dev qr-396e87de-4b
sudo ip netns exec qrouter-aba1e526-05ca-4aca-9a80-01601cdee79d ip addr del 10.0.0.1/24 dev qr-396e87de-4b

Step 3: Update the interface of the qdhcp namespace

Still on the network node there is a namespace qdhcp-<network_id>. Exactly in the same way we did for the qrouter namespace we are going to find the interface name, and change the IP address with the updated netmask.

sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr add 10.0.0.2/16 dev tapadebc2ff-10
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr del 10.0.0.2/24 dev tapadebc2ff-10
sudo ip netns exec qdhcp-1dc116e9-1ec9-49f6-9d92-4483edfefc9c ip addr show

The dnsmasq process running bounded to the interface in the qdhcp namespace, is smart enough to detect automatically the change in the interface configuration. This means that the new instances at this point will get via DHCP a /16 netmask.

Step 4: (Optional) Adjust the subnet name in Horizon

We called the subnet name 10.0.0.0/24. For pure cosmetic we logged in the Horizon web interface as admin and changed the name of the subnet to 10.0.0.0/16.

Step 5: Adjust the allocation pool for the subnet

Now that the subnet is wider, the neutron client will let you configure a wider allocation pool. First check the existing allocation pool:

$ neutron subnet-list | grep 2e06c039-b715-4020-b609-779954fa4399

| 2e06c039-b715-4020-b609-779954fa4399 | 10.0.0.0/16     | 10.0.0.0/16      | {"start": "10.0.0.2", "end": "10.0.0.254"}           |

You can resize easily the allocation pool like this:

neutron subnet-update 2e06c039-b715-4020-b609-779954fa4399 --allocation-pool start='10.0.0.2',end='10.0.255.254'

Step 6: Check status of the VMs

At this point the new instances will get an IP address from the new allocation pool.

As for the existing instances, they will continue to work with the /24 address mask. In case of reboot they will get via DHCP the same IP address but with the new address mask. Also, when the DHCP lease expires, depending on the DHCP client implementation, they will hopefully get the updated netmask. This is not the case with the default Ubuntu dhclient, that will not refresh the netmask when the IP address offered by the DHCP server does not change.

The worst case scenario is when the machine keeps the old /24 address mask for a long time. The outbound traffic to other machines in the private network might experience a suboptimal routing through the network node, that will be used as a default gateway.

Conclusion

We successfully expanded a Neutron network to a wider IP range without service interruption. Understanding Neutron internals it is possible to make changes that go beyond the features of Neutron. It is very important to understand how the values in the Neutron database are used to create the network namespaces.

We understood that a better design for our cloud would be to have a default Neutron network per tenant, instead of a shared default network for all tenants.


SWITCHengines upgraded to OpenStack 2014.1 “Juno”

Our Infrastructure-as-a-Service (IaaS) offering SWITCHengines is based on the OpenStack platform.  OpenStack releases new alphabetically-nicknamed versions every six months.  When we built SWITCHengines in 2014, we based it on the then-current “Icehouse” (2014.1) release.  Over the past few months, we have worked on upgrading the system to the newer “Juno” (2014.2) version.  As we already announced via Twitter, this upgrade was finally completed on 26 August.  The upgrade was intended to be interruption-free for running customer VMs (including the SWITCHdrive service, which is built on top of such VMs), and we mostly achieved that.

Why upgrade?

Upgrading a live infrastructure is always a risk, so we should only do so if we have good reasons.  On a basic level, we see two drivers: (a) functionality and (b) reliability.  Functionality: OpenStack is a very dynamic project to which new features—and entire new subsystems—are added all the time.  We want to make sure that our users can benefit from these enhancements.  Reliability: Like all complex software, OpenStack has bugs, and we want to offer reliable and unsurprising service.  Fortunately, OpenStack also has more and more users, so bugs get reported and eventually fixed, and it has quality assurance (QA) processes that improve over time.  Bugs are usually fixed in the most recent releases only.  Fixes to serious bugs such as security vulnerabilities are often “backported” to one or two previous releases.  But at some point it is no longer safe to use an old release.

Why did it take so long?

We wanted to make sure that the upgrade be as smooth as possible for users.  In particular, existing VMs and other resources should remain in place and continue to work throughout the upgrade.  So we did a lot of testing on our internal development/staging infrastructure.  And we experimented with various different methods for switching over.  We also needed to integrate the—significant—changes to the installation system recipes (from the OpenStack Puppet project) with our own customizations.

We also decided to upgrade the production infrastructure in three phases.  Two of them had been announced: The LS region (in Lausanne) was upgraded on 17 August, the ZH (Zurich) region one week later.  But there are some additional servers with special configuration which host a critical component of SWITCHdrive.  Those were upgraded separately another day later.

Because we couldn’t upgrade all hypervisor nodes (the servers on which VMs are hosted) at the same time, we had to run in a compatibility mode that allowed Icehouse hypervisors to work against a Juno controller.  After all hypervisor hosts were upgraded, this somewhat complex compatibility mechanism could be disabled again.

The whole process took us around five months.  Almost as long as the interval between OpenStack releases! But we learned a lot, and we made some modifications to our setup that will make future upgrades easier.  So we are confident that the next upgrade will be quicker.

So it all went swimmingly, right?

Like I wrote above, “mostly”.  All VMs kept running throughout the upgrade.  As announced, the “control plane” was unavailable for a few hours, during which users couldn’t start new VMs.  As also announced, there was a short interruption of network connectivity for every VM.  Unfortunately, this interruption turned out to be much longer for some VMs behind user-defined software routers.  Some of these routers were misconfigured after the upgrade, and it took us a few hours to diagnose and repair those.  Sorry about that!

What changes for me as a SWITCHengines user?

OpenStack dashboard in Juno: The new combined region and project selector

OpenStack dashboard in Juno: The new combined region and project selector

Not much, so far.  There are many changes “under the hood”, but only a few are visible.  If you use the dashboard (“Horizon”), you will notice a few slight improvements in the user interface.  For instance, the selectors for region—LS or ZH—and project—formerly called “tenant”—have been combined into a single element.

The many bug fixes between Icehouse and Juno should make the overall SWITCHengines experience more reliable.  If you notice otherwise, please let us know through the usual support channel.

What’s next?

With the upgrade finished, we will switch back to our previous agile process of rolling out small features and fixes every week or so.  There are a few old and new glitches that we know we have to fix over the next weeks.  We will also add more servers to accommodate increased usage.  To support this upgrade, we will replace the current network in the ZH region with a more scalable “leaf/spine” network architecture based on “bare-metal” switches.  We are currently testing this in a separate environment.

By the end of the year, we will have a solid infrastructure basis for SWITCHengines, which will “graduate” from its current pilot phase and become a regular service offering in January 2016.  In the SCALE-UP project, which started in August 2015 with the generous support of swissuniversities’ SUC P-2 program, many partners from the Swiss university community will work together to add higher-level services and additional platform enhancements.  Stay tuned!


1 Comment

Buffering issues when publishing Openstack dashboard and API services behind a HTTP reverse proxy

At SWITCH we operate SWITCHengines, a public OpenStack cloud for Swiss universities. To expose our services to the public Internet, we use the popular open source nginx reverse proxy. For the sake of simplicity we show in the following figure a simplified schema of our infrastructure, with only the components relevant to this article. Every endpoint API service and the Horizon Dashboard are available behind a reverse proxy.

SWITCHEngines reverse proxy

SWITCHEngines reverse proxy

The Problem:

Our users reported not being able to upload images using the Horizon web interface when images were large files over 10GB.

We did some tests ourselves and we noticed that the image upload process was too slow. Looking at log files, we noticed that the upload process for an image of 10GB was slow enough to make the Keystone auth token expire before the end of the process.
Why was uploading so slow ?

The analysis:

The usual scenario for the reverse proxy is load balancing to a pool of web servers. There is a slow network, like the Internet, between the users and the proxy, and there is a fast network between the proxy and the web servers.

Typical Reverse Proxy Architecture

Typical Reverse Proxy Architecture

The goal is to keep the Web Servers busy the smallest possible amount of time serving the client requests. To achieve this goal, the reverse proxy buffers the requests, and interacts with the web server only when the request is completely cached. Then the web server interacts only with the proxy on the fast network and gets rid of the latency of the slow network.
If we look at the default settings of Nginx we note that proxy_buffering is enabled.

When buffering is enabled, nginx receives a response from the proxied server as soon as possible, saving it into the buffers set by the proxy_buffer_size and proxy_buffers directives. If the whole response does not fit into memory, a part of it can be saved to a temporary file on the disk. Writing to temporary files is controlled by the proxy_max_temp_file_size and proxy_temp_file_write_size directives.

However the proxy_buffering configuration directive refers to the traffic from the web server to the user, the HTTP response that is the largest traffic when a user wants to download a web page.

In our case the user is uploading an image to the web server and the largest traffic is in the user request, not in the server response. Luckily in nginx 1.7 a new configuration option has been introduced: proxy_request_buffering

This is also enabled by default:

When buffering is enabled, the entire request body is read from the client before sending the request to a proxied server.
When buffering is disabled, the request body is sent to the proxied server immediately as it is received. In this case, the request cannot be passed to the next server if nginx already started sending the request body.

But what happens if the user’s network is also a very fast network such as SWITCHlan? And does it make sense to have such large buffers for big files over 10GB ?

Let’s see what happens when a users tries to upload an image from his computer to Glance using the Horizon web interface. You will be surprised to know that the image is buffered 3 times.

Components involved in the image upload process

Components involved in the image upload process

The user has to wait for the image to be fully uploaded to the first nginx server in front of the Horizon server, then the Horizon application stores completely the image again. At this point the public API Glance is again published behind a nginx reverse proxy and we have to wait again the time to buffer the image and then finally the last transfer to Glance.

This 3 times buffering leads to 4 upload operations from one component to another. A 10GB images then requires 10GB on the Internet and 30GB of machine to machine traffic in the OpenStack LAN.

The solution:

Buffering does not make sense in our scenario and introduces long waiting times for the buffers to get filled up.

To improve this situation we upgraded nginx to 1.8 and we configured both proxy_buffering and proxy_request_buffering to off. With this new configuration the uploaded images are buffered only once, at the Horizon server. The process of image upload with web interface is now reasonably fast and we don’t have Keystone auth tokens expiring anymore.


Is there a chance for a Swiss Academic Cloud?

At our recent ICT Focus Meeting where SWITCH customers and SWITCH employees meet to discuss the needs of customers, Edouard Bugnion, one of the founders of VMware and now professor for Computer Science at EPFL held an interesting keynote speech “Towards Data Center Systems“. Our project lead, Patrik Schnellmann had the opportunity to conduct an interview with Edouard about Swiss Academic Clouds.

We are of course hard at work, to build a cloud offering for Swiss Academia – SWITCHengines and Edouard’s views justify what we are doing.


Adding 60 Terabytes to a Ceph Cluster

[Note: This post was republished from the now-defunct “petablog”]

BCC – an Experiment that “Escaped the Lab”

Starting in Fall 2012, we built a small prototype “cloud” consisting of about ten commodity servers running OpenStack and Ceph.  That project was called Building Cloud Competence (BCC), and the primary intended purpose was to acquire experience running such systems.  We worked with some “pilot” users, both external (mostly researchers) and internal (“experimental” services).  As these things go, experiments become beta tests, and people start relying on them… so this old BCC cluster now supports several visible applications such as SWITCH’s SourceForge mirror, the SWITCHdrive sync & share service, as well as SWITCHtube, our new video distribution platform.  In particular, SWITCHtube uses our “RadosGW” service (similar to Amazon’s S3) to stream HTML5 video directly from our Ceph storage system.

Our colleagues from the Interaction Enabling team would like to enhance SWITCHtube by adding more of the content that has been produced by users of the SWITCHcast lecture recording system.  This could amount to 20-30TB of new data on our Ceph cluster.  Until last week, the cluster consisted of fifty-three 3TB disks distributed across eight hosts, for a total raw storage capacity of 159TB, corresponding to 53TB of usable storage given the three-way replication that Ceph uses by default.  That capacity was already used to around 40%.  In order to accomodate the expected influx of data, we decided to upgrade capacity by adding new disks.

IMG_0543_400x300Since we bought the original disks less than two years ago, the maximum capacity for this type of disks – the low-power/low-cost/low-performance disks intended for “home NAS” applications – has increased from 3TB to 6TB.  So by adding just ten new disks, we could increase total capacity by almost 38%.  We found that for a modest investment, we could significantly extend the usable lifetime of the cluster.  Our friendly neighborhood hardware store had 14 items in stock, so we quickly ordered ten and built them into our servers the next morning.

Rebalancing the Cluster

Now, the Ceph cluster adapts to changes such as disks being added or lost (e.g. by failing) by redistributing data across the cluster.  This is a great feature because it makes the infrastructure very easy to grow and very robust to failures.  It is also quite impressive to watch, because redistribution makes use of the entire capacity of the cluster.  Unfortunately this tends to have a noticeable impact on the performance as experienced by other users of the storage system, in particular when writing to storage.  So in order to minimize annoyance to users, we scheduled the integration of the new disks for Friday late in the afternoon.

ceph-insert-6TB-48h

We use Graphite for performance monitoring of various aspects of the BCC cluster.  Here is one of the Ceph-related graphs showing what happened when the disks were added to the cluster as new “OSDs” (Object Storage Daemons).  The graph shows, for each physical disk server in the Ceph cluster, the rate of data written to disk, summed up across all disks for a given server.  The grey, turqoise, and yellow graphs correspond to servers h4, h1s, and h0s, respectively.  These servers are the ones that got the new 6TB disks: h4 got 5, h1s got 3, and h0s got 2.

We can see that the process of reshuffling data took about 20 hours, starting shortly after 1700 on Friday, and terminating around 1330 on Saturday.  The rate of writing to the new disks exceeded a Gigabyte per second for several hours.

Throughput is limited by the speed at which the local filesystem can write to disks (in particular the new ones) over the 6 Gb/s SATA channels, and by how fast the data copies can be retrieved from the old disks.  As most of the replication is done across the network, that could also become a bottleneck.  But in our case each server has dual 10GE connections, so in this case the network supports more throughput for each server than the disks can handle.  Why does it get slower over time? I guess one reason is that writing to a fresh file system is faster than writing to one that already has data on it, but I’m not sure whether that’s sufficient explanation.

Outlook

Based upon the experiences from the BCC project, SWITCH decided to start building cloud infrastructure in earnest.  We secured extensible space in two university data center locations in Lausanne and Zurich, and started deploying new clusters in Spring 2014.  We are now in the process of fine-tuning the configuration in order to ensure reliable operation, scalability, and fast and future-proof networking.  These activities are supported by the CUS program P-2 “information scientifique” as project “SCALE”.  We hope to make OpenStack-based self-service VMs available to first external users this Fall.


Doing the right thing

I am returning from GridKA school, held annually at the KIT in Karlsruhe, where I co-hosted a two day workshop on installing OpenStack with Antonio Messina and Tyanko Alekseiev from the university of Zurich. (You can find the course notes and tutorials over on Github ). I don’t want to talk about the workshop so much (it was fun, out attendees were enthusiastic and we ended up with 8 complete OpenStack Grizzly clouds) as about the things that I experienced in the plenary sessions.
A bit of background on me: I joined SWITCH in April 2013 to work on the cloud. Before that, I had been self-employed, running my own companies, worked in a number of startups. I left academia in 1987 (without a degree) and returned to it in 2010 when I started (and  finished) a Masters in Science. Early on, friends and family told me that I should pursue an academic career, but I always wanted to prove myself in the commercial world… Well, being a bit closer to Academia was one of the reasons I joined SWITCH.
Back to GridKA: Presenting at the workshop, teaching and helping people with a complex technical software is something I have done quite a bit over the last 20 years, and something I’m quite good at (or so my students tell me). Nothing special, business as usual so to speak. 
There also was a plenary program with presentations from various people attending GridKA school. And although I only got to see a few of those due to my schedule, I was absolutely blown away by what I heard. Dr. Urban Liebel talked about  microscopes in Life Sciences – the ability to automatically take pictures of thousands of samples and use image recognition algorithms to classify them. He told about some of the results they discovered (Ibuprofen is doing damage to kidneys in children and increases the risk of kidney cancer, something science didn’t know until recently) now that they can investigate more samples faster.
José Luis Vázquez-Poletti in his talk “Cloud Computing: Expanding Humanity’s Limits to Planet Mars” talked about installing meterological sensors on Mars and how to use cloud computing ressources to help pinpoint the location of those sensors, once they had been deployed on Mars (basically by just dropping them down on the surface – ballistic entry). By looking at the transitions of Phobos, the moon of Mars, they are able to determine the location of the landed sensor.
Bendedikt Hegener from CERN talked about “Effective Programming and Multicore Computing” in which he described the trials and tribulations the CERN programmmers have to go through to parallelize 5 million lines of code in order to make the code take advantage of multi-core computers.
There were several other talks that I unfortunately didn’t have a chance to attend. The point of all this?
During those talks it hit me, that the work these scientists are doing is creating value on a much deeper level, than what most startups are creating. By working on the methods to automatically take microscopic pictures and analyse them, and increasing the throughput, these people directly work on the improvments of our living conditions. While the Mars and CERN experiments don’t seem to have immediate benefits, both space research and high energy physics have greatly contributed to our lives as well. A startup that is creating yet another social network, yet another photo sharing site, all with the intent of making investors happy (by generating loads of money) just doesn’t have the same impact on society.
My work here in SWITCH doesnt’t really have the same impact but I think that the work building Cloud infrastructure can help some researchers out there in Switzerland achieve their work more easily, faster or cheaper. In which case, my work at least contributed in a “supporting act”. What more could one want?


The PetaSolutions Blog

Welcome, dear reader, to the Peta Solutions Blog. “Another blog?”, you ask – yes very much so…

Let me start by providing a bit of background to who we are and what we are doing, this might help set the context for the diversity of things you are going to read here.

The Peta Solutions teams is located in the “Researchers and Lecturers” Division of SWITCH. Peta (of course) means big (bigger than Tera, anyway) and gives an indication of what we are working with:
Big things… We are here to help researchers with, shall we say, specialised needs in their ITC infrastructure. This started several years ago with Grid activities (several of our team members have been working in Grid related projects the last years), Cloud (we have been busy building our own cloud over the last months), SDN (Software Defined Networking), Network performance (our PERT – Performance Emergency Response Team stands by in case of performance problems) and more.

We work directly with researchers, and help them getting up to speed on these issues.

So what should you expect from this blog? We have a couple of ideas, some of us have blogged for quite a while, some are taking a wait and see attitude – the normal mix in other words.

We plan to talk about our experiences building, maintaining and operating infrastructure, maybe providing you with the crucical nugget of information that helps you solve a problem. We invite researchers we are working with to share their experiences. We sometimes will wax philosophically about things that are on our collective minds.

In any case, we are happy if all of this turns into a discourse: you are most welcome to respond.

Yours
Alessandra, Alessandro, Jens-Christian, Kurt, Placi, Rüdiger, Sam, Simon, Valery