SWITCH Cloud Blog


Server Power Measurement: Quick Experiment

In December 2015, we received a set of servers to extend the infrastructure that powers SWITCHengines (and indirectly SWITCHdrive, SWITCHfilesender and other services).  Putting these in production will take some time, because this also requires a change in our network setup, but users should start benefiting from it starting in February.

Before the upgrade, we used a single server chassis type for both “compute” nodes—i.e. where SWITCHengines instances are executed as virtual machines—and “storage” nodes where all the virtual disks and other persistent objects are stored.  The difference simply was that some servers were full of high-capacity disks, where the others had many empty slots.  We knew this was wasteful in terms of rack utilization, but it gave us more flexibility while we were learning how our infrastructure was used.

The new servers are different: Storage nodes look very much like the old storage nodes (which, as mentioned, look very similar to the old compute nodes), just with newer motherboard and newer (but also fewer and less powerful) processors.

The compute nodes are very different though: The chassis have the same size as the old ones, but instead of one server or “node”, the new compute chassis contain four.  All four nodes in a chassis share the same set of power supplies and fans, two of each for redundancy.

Now we use tools such as IPMI to remotely monitor our infrastructure to make sure we notice when fans or power supplies fail, or temperature starts to increase to concerning levels.  Each server has a “Baseboard Management Controller” (BMC) that exposes a set of sensors for that.  The BMC also allows resetting or even powering down/up the server (except for the BMC itself!), and getting to the serial or graphical console over the network, all of which can be useful for maintenance.

Each node has its own BMC, and each BMC gives sensor information about the (two) power supplies.  This is a little weird because there are only two power supplies in the chassis, but we can monitor eight—two per node/BMC, of which there are four.  Which raises some doubts: Am I measuring the two power supplies in the chassis at all? Or are the measurements from some kind of internal power supplies that each node has (and that feeds from the central power supplies)?

As a small experiment, I started with a chassis that had all four nodes powered up and running.  I started polling the power consumption readings on one of the four servers every ten seconds.  While that was running, I shut down the three other servers.  Here are the results:

$ while true; do date; \
  sudo ipmitool sensor list | grep 'Power In'; \
  sleep 8; done
Thu Jan 14 12:53:34 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:43 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:53 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:02 CET 2016
PS1 Power In | 320.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:11 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:20 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:30 CET 2016
PS1 Power In | 180.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:39 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:48 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na

One observation is that the resolution of the power measurement seems to be 10W.  Another observation is that PS2 consistently draws 10W—which might mean anything between 5 and 15.  Obviously the two power supplies function in active/standby modes and PS1 is the active one.

But the central result is that the power draw of PS1 falls from 310W when all four nodes are running (but not really doing much outside running the operating system) to 110W when only one is running.  This suggests that we’re actually measuring the shared power supplies, and not something specific to the node we were polling.  It also suggests that each node consumes about 70W in this “baseline” state, and that there is a base load of 40W for the chassis.  Of course these numbers are highly unscientific and imprecise, given the trivial number of experiments (one) and the bad sensor resolution and, presumably, precision.

view from Lungarno Pacinotti on the river Arno


Impressions from 19th TF-Storage workshop in Pisa

National Research and Education Networks (NRENs) such as SWITCH exist in every European country. They have a long tradition of working together. An example for this are Task Forces on different topics under the umbrella of the GÉANT Association (formerly TERENA). One of them is TF-Storage, which since 2008 has been a forum to exchange knowledge about various storage technologies and their application in the NREN/academic IT context. Its 19th meeting took place in Pisa last week (13/14 October). It was the first one that I attended on site. But I had been following the group via its mailing list for several years, and the agenda included several topics relevant to our work, so I was looking forward to learning from the presentations and to chatting with people from other NRENs (and some universities) who run systems similar to ours.

Getting there

Zurich is extremely well connected transport-wise, but getting to Pisa without spending an extra night proved to be challenging. I decided to take an early flight to Florence, then drive a rented car to Pisa. That went smoothly until I got a little lost in the suburbs of Pisa, but after two rounds on the one-way lungarni (Arno promenades) I finally had the car parked at the hotel and walked the 100m or so to the venue at the university. Unfortunately I arrived at the meeting more than an hour after it had started.

view from Lungarno Pacinotti on the river Arno

View of the river Arno from Lungarno Pacinotti. The meeting venue is one of the buildings on the right.

Day 1: Ceph, Ceph, Ceph…

The meeting started with two hours of presentations by Joao Eduardo Luis from SUSE about various aspects of Ceph, a distributed file system that we use heavily in SWITCHengines. In the part that I didn’t miss, Joao talked about numerous new features in different stages of development. Sometimes I think it would be better to make the current functionality more robust and easier to use. Especially the promise of more tuning knobs being added seems unattractive to me—from an operator’s point of view it would be much nicer if less tuning were necessary.

The ensuing round-table discussion was interesting. Clearly several people in the room had extensive experience with running Ceph clusters. Especially Panayiotis Gotsis from GRNET asked many questions which showed a deep familiarity with the system.

Next, Axel Rosenberg from Sandisk talked about their work on optimizing Ceph for use with Flash (SSD) storage. Sandisk has built a product called “IFOS” based on Ubuntu GNU/Linux and an enhanced version of Ceph. They identified many bottlenecks in the Ceph code that show up when the disk bottleneck is lifted by use of fast SSDs. Sandisk’s changes resulted in speedup of some benchmarks by a factor of ten—notably with the same type of disks. The improvements will hopefully find their way into “upstream” Ceph and be thoroughly quality-assured. The most interesting slide to me was about work to reduce the impact of recovery from a failed disk. By adding some priorization (I think), they were able to massively improve performance of user I/O during recovery—let’s say rather than being ten times slower than usual, it would only be 40% slower—while the recovery process took only a little bit longer than without the priorization. This is an area that needs a lot of work in Ceph.

Karan Singh from CSC (which is “the Finnish SWITCH”, but also/primarily “the Finnish CSCS”) presented how CSC uses Ceph as well as their Ceph dashboard. Karan has actually written a book on Ceph! CSC plans to use Ceph as a basis for two OpenStack installations, cPouta (classic public/community cloud service) and ePouta (for sensitive research data). They have been doing extensive research of Ceph including some advanced features such as Erasure Coding—which we don’t consider for SWITCHengines just yet. Karan also talked about tuning the system and diagnosing issues, which can lead to discover low-level problems such as network cabling issues in one case he reported.

Simone Spinelli from the hosting university of Pisa talked about how they use Ceph to support an OpenStack based virtual machine hosting service. I discovered that they did many things in a similar way to us, using Puppet, Foreman, Graphite to support installation and operation of their system. An interesting twist is they have multiple smaller sites distributed across the city, and their Ceph cluster spans these sites. In contrast, at SWITCH we operate separate clusters in our two locations in Lausanne and Zurich. There are several technical reasons for doing so, although we consider adding a third cluster that would span the two locations (and adding a tiny third one) for special applications that require resilience against the total failure of a data center or its connection to the network.

Day 2: Scality, OpenStack, ownCloud

The second day was opened by Bradley King from Scality presenting on object stores vs. file stores. This was a wonderful presentation that would be worth a blog post of its own. Although it was naturally focused on Scality’s “RING” product, it didn’t come over as marketing at all, and contained many interesting insights about distributed storage design trade-offs, stories from actual deployments—Scality has several in the multi-Petabyte range—and also some future perspectives, for example about “IP drives”. These are disk drives with Ethernet/IP interfaces rather than the traditional SATA or SAS attachments, and which support S3-like object interfaces. What was new to me was that new disk technologies such as SMR (shingled magnetic recording) and HAMR (heat-assisted magnetic recording) seem to be driving disk vendors towards this kind of interface, as traditional block semantics are becoming quite hard to emulate with these types of disk. My takeaway was that Scality RING looks like a well-designed system, similarly elegant as Ceph, but with some trade-offs leaning towards simplicity and operational ease. To me the big drawback compared to Ceph is that it (like several other “software-defined storage” systems) is closed-source.

The following three were about collaboration activities between NRENs (and, in some cases, vendors):

Maciej Brzeźniak from PSNC (the Polish “SWITCH+CSCS”) talked about the TCO Calculator for (mainly Ceph-based) software-defined storage systems that some TF-Storage members have been working on for several months. Maciej is looking for more volunteers to contribute data to it. One thing that is missing are estimates for network (port) costs. I volunteered to provide some numbers for 10G/40G leaf/spine networks built from “whitebox” switches, because we just went through a procurement exercise for such a project.

Next, yours truly talked about the OSO get-together, a loosely organized group of operators of OpenStack-based IaaS installations that meets every other Friday over videoconferencing. I talked about how the group evolved and how it works, and suggested that this could serve as a blueprint for closer cooperation between some TF-Storage members on some specific topics like building and running Ceph clusters. Because there is significant overlap between the OSO (IaaS) and (in particular Ceph) storage operators, we decided that interested TF-Storage people should join the OSO mailing list and the meetings, and that we see where this will take us. [The next OSO meeting was two days later, and a few new faces showed up, mostly TF-Storage members, so it looks like this could become a success.]

Finally Peter Szegedi from the GÉANT Association talked about the liaison with OpenCloudMesh, which is one aspect of a collaboration of various NRENs (including AARnet from Australia) and other organizations (such as CERN) who use the ownCloud software to provide file synchronization and sharing service to their users. SWITCH also participates in this collaboration, which lets us share our experience running the SWITCHdrive service, and in return provides us with valuable insights from others.

The meeting closed with the announcement that the next meeting would be in Poznań at some date to be chosen later, carefully avoiding clashes with the OpenStack meeting in April 2016. Lively discussions ensued after the official end of the meeting.

Getting back

Driving back from Pisa to Florence airport turned out to be interesting, because the rain, which had been intermittent, had become quite heavy during the day. Other than that, the return trip was uneventful. Unfortunately I didn’t even have time to see the leaning tower, although it would probably have been a short walk from the hotel/venue. But the tiny triangle between meeting venue, my hotel, and the restaurant where we had dinner made a very pleasant impression on me, so I’ll definitely try to come back to see more of this city.

rainy-small

Waiting if the car in front of me makes it safely through the flooded stretch under the bridge… yup, it did.


SWITCHengines upgraded to OpenStack 2014.1 “Juno”

Our Infrastructure-as-a-Service (IaaS) offering SWITCHengines is based on the OpenStack platform.  OpenStack releases new alphabetically-nicknamed versions every six months.  When we built SWITCHengines in 2014, we based it on the then-current “Icehouse” (2014.1) release.  Over the past few months, we have worked on upgrading the system to the newer “Juno” (2014.2) version.  As we already announced via Twitter, this upgrade was finally completed on 26 August.  The upgrade was intended to be interruption-free for running customer VMs (including the SWITCHdrive service, which is built on top of such VMs), and we mostly achieved that.

Why upgrade?

Upgrading a live infrastructure is always a risk, so we should only do so if we have good reasons.  On a basic level, we see two drivers: (a) functionality and (b) reliability.  Functionality: OpenStack is a very dynamic project to which new features—and entire new subsystems—are added all the time.  We want to make sure that our users can benefit from these enhancements.  Reliability: Like all complex software, OpenStack has bugs, and we want to offer reliable and unsurprising service.  Fortunately, OpenStack also has more and more users, so bugs get reported and eventually fixed, and it has quality assurance (QA) processes that improve over time.  Bugs are usually fixed in the most recent releases only.  Fixes to serious bugs such as security vulnerabilities are often “backported” to one or two previous releases.  But at some point it is no longer safe to use an old release.

Why did it take so long?

We wanted to make sure that the upgrade be as smooth as possible for users.  In particular, existing VMs and other resources should remain in place and continue to work throughout the upgrade.  So we did a lot of testing on our internal development/staging infrastructure.  And we experimented with various different methods for switching over.  We also needed to integrate the—significant—changes to the installation system recipes (from the OpenStack Puppet project) with our own customizations.

We also decided to upgrade the production infrastructure in three phases.  Two of them had been announced: The LS region (in Lausanne) was upgraded on 17 August, the ZH (Zurich) region one week later.  But there are some additional servers with special configuration which host a critical component of SWITCHdrive.  Those were upgraded separately another day later.

Because we couldn’t upgrade all hypervisor nodes (the servers on which VMs are hosted) at the same time, we had to run in a compatibility mode that allowed Icehouse hypervisors to work against a Juno controller.  After all hypervisor hosts were upgraded, this somewhat complex compatibility mechanism could be disabled again.

The whole process took us around five months.  Almost as long as the interval between OpenStack releases! But we learned a lot, and we made some modifications to our setup that will make future upgrades easier.  So we are confident that the next upgrade will be quicker.

So it all went swimmingly, right?

Like I wrote above, “mostly”.  All VMs kept running throughout the upgrade.  As announced, the “control plane” was unavailable for a few hours, during which users couldn’t start new VMs.  As also announced, there was a short interruption of network connectivity for every VM.  Unfortunately, this interruption turned out to be much longer for some VMs behind user-defined software routers.  Some of these routers were misconfigured after the upgrade, and it took us a few hours to diagnose and repair those.  Sorry about that!

What changes for me as a SWITCHengines user?

OpenStack dashboard in Juno: The new combined region and project selector

OpenStack dashboard in Juno: The new combined region and project selector

Not much, so far.  There are many changes “under the hood”, but only a few are visible.  If you use the dashboard (“Horizon”), you will notice a few slight improvements in the user interface.  For instance, the selectors for region—LS or ZH—and project—formerly called “tenant”—have been combined into a single element.

The many bug fixes between Icehouse and Juno should make the overall SWITCHengines experience more reliable.  If you notice otherwise, please let us know through the usual support channel.

What’s next?

With the upgrade finished, we will switch back to our previous agile process of rolling out small features and fixes every week or so.  There are a few old and new glitches that we know we have to fix over the next weeks.  We will also add more servers to accommodate increased usage.  To support this upgrade, we will replace the current network in the ZH region with a more scalable “leaf/spine” network architecture based on “bare-metal” switches.  We are currently testing this in a separate environment.

By the end of the year, we will have a solid infrastructure basis for SWITCHengines, which will “graduate” from its current pilot phase and become a regular service offering in January 2016.  In the SCALE-UP project, which started in August 2015 with the generous support of swissuniversities’ SUC P-2 program, many partners from the Swiss university community will work together to add higher-level services and additional platform enhancements.  Stay tuned!


Adding 60 Terabytes to a Ceph Cluster

[Note: This post was republished from the now-defunct “petablog”]

BCC – an Experiment that “Escaped the Lab”

Starting in Fall 2012, we built a small prototype “cloud” consisting of about ten commodity servers running OpenStack and Ceph.  That project was called Building Cloud Competence (BCC), and the primary intended purpose was to acquire experience running such systems.  We worked with some “pilot” users, both external (mostly researchers) and internal (“experimental” services).  As these things go, experiments become beta tests, and people start relying on them… so this old BCC cluster now supports several visible applications such as SWITCH’s SourceForge mirror, the SWITCHdrive sync & share service, as well as SWITCHtube, our new video distribution platform.  In particular, SWITCHtube uses our “RadosGW” service (similar to Amazon’s S3) to stream HTML5 video directly from our Ceph storage system.

Our colleagues from the Interaction Enabling team would like to enhance SWITCHtube by adding more of the content that has been produced by users of the SWITCHcast lecture recording system.  This could amount to 20-30TB of new data on our Ceph cluster.  Until last week, the cluster consisted of fifty-three 3TB disks distributed across eight hosts, for a total raw storage capacity of 159TB, corresponding to 53TB of usable storage given the three-way replication that Ceph uses by default.  That capacity was already used to around 40%.  In order to accomodate the expected influx of data, we decided to upgrade capacity by adding new disks.

IMG_0543_400x300Since we bought the original disks less than two years ago, the maximum capacity for this type of disks – the low-power/low-cost/low-performance disks intended for “home NAS” applications – has increased from 3TB to 6TB.  So by adding just ten new disks, we could increase total capacity by almost 38%.  We found that for a modest investment, we could significantly extend the usable lifetime of the cluster.  Our friendly neighborhood hardware store had 14 items in stock, so we quickly ordered ten and built them into our servers the next morning.

Rebalancing the Cluster

Now, the Ceph cluster adapts to changes such as disks being added or lost (e.g. by failing) by redistributing data across the cluster.  This is a great feature because it makes the infrastructure very easy to grow and very robust to failures.  It is also quite impressive to watch, because redistribution makes use of the entire capacity of the cluster.  Unfortunately this tends to have a noticeable impact on the performance as experienced by other users of the storage system, in particular when writing to storage.  So in order to minimize annoyance to users, we scheduled the integration of the new disks for Friday late in the afternoon.

ceph-insert-6TB-48h

We use Graphite for performance monitoring of various aspects of the BCC cluster.  Here is one of the Ceph-related graphs showing what happened when the disks were added to the cluster as new “OSDs” (Object Storage Daemons).  The graph shows, for each physical disk server in the Ceph cluster, the rate of data written to disk, summed up across all disks for a given server.  The grey, turqoise, and yellow graphs correspond to servers h4, h1s, and h0s, respectively.  These servers are the ones that got the new 6TB disks: h4 got 5, h1s got 3, and h0s got 2.

We can see that the process of reshuffling data took about 20 hours, starting shortly after 1700 on Friday, and terminating around 1330 on Saturday.  The rate of writing to the new disks exceeded a Gigabyte per second for several hours.

Throughput is limited by the speed at which the local filesystem can write to disks (in particular the new ones) over the 6 Gb/s SATA channels, and by how fast the data copies can be retrieved from the old disks.  As most of the replication is done across the network, that could also become a bottleneck.  But in our case each server has dual 10GE connections, so in this case the network supports more throughput for each server than the disks can handle.  Why does it get slower over time? I guess one reason is that writing to a fresh file system is faster than writing to one that already has data on it, but I’m not sure whether that’s sufficient explanation.

Outlook

Based upon the experiences from the BCC project, SWITCH decided to start building cloud infrastructure in earnest.  We secured extensible space in two university data center locations in Lausanne and Zurich, and started deploying new clusters in Spring 2014.  We are now in the process of fine-tuning the configuration in order to ensure reliable operation, scalability, and fast and future-proof networking.  These activities are supported by the CUS program P-2 “information scientifique” as project “SCALE”.  We hope to make OpenStack-based self-service VMs available to first external users this Fall.