SWITCH Cloud Blog


SWITCHengines Under the Hood: Basic IPv6 Configuration

My last post, IPv6 Finally Arriving on SWITCHengines, described what users of our IaaS offering can expect from our newly introduced IPv6 support: Instances using the shared default network (“private”) will get publicly routable IPv6 addresses.

This post explains how we set this up, and why we decided to go this route.  We hope that this is interesting to curious SWITCHengines users, and useful for other operators of OpenStack infrastructure.

Before IPv6: Neutron, Tenant Networks and Floating IP

[Feel free to skip this section if you are familiar with Tenant Networks and Floating IPs.]

SWITCHengines uses Neutron, the current generation of OpenStack Networking.  Neutron supports user-definable networks, routers and additional “service” functions such as Load Balancers or VPN gateways.  In principle, every user can build her own network (or several) isolated from the other tenants.  There is a default network accessible to all tenants.  It is called private, which I find quite confusing because it is totally not private, but shared between all the tenants.  But it has a range of private (in the sense of RFC 1918) IPv4 addresses—a subnet in OpenStack terminology—that is used to assign “fixed” addresses to instances.

There is another network called public, which provides global connectivity.  Users cannot connect instances to it directly, but they can use Neutron routers (which include NAT functionality) to route between the private (RFC 1918) addresses of their instances on tenant networks (whether the shared “private” or their own) and the public network, and by extension, the Internet.  By default, they get “outbound-only” connectivity using 1:N NAT, like users behind a typical broadband router.  But they can also request a Floating IP, which can be associated with a particular instance port.  In this case, a 1:1 NAT provides both outbound and inbound connectivity to the instance.

The router between the shared private network and the external public network was provisioned by us; it is called private-router.  Users who build their own tenant networks and want to connect them with the outside world need to set up their own routers.

This is a fairly standard setup for OpenStack installations, although some operators, especially in the “public cloud” business, forgo private addresses and NAT, and let customers connect their VMs directly to a network with publicly routable addresses.  (Sometimes I wish we’d have done that when we built SWITCHengines—but IPv4 address conservation arguments were strong in our minds at the time.  Now it seems hard to move to such a model for IPv4.  But let’s assume that IPv6 will eventually displace IPv4, so this will become moot.)

Adding IPv6: Subnet, router port, return route—that’s it

So at the outset, we have

  • a shared internal network called private
  • a provider network with Internet connectivity called public
  • a shared router between private and public called private-router

We use the “Kilo” (2015.1) version of OpenStack.

As another requirement, the “real” network underlying the public network (in our case a VLAN) needs connectivity to the IPv6 Internet.

Create an IPv6 Subnet with the appropriate options

And of course we need a fresh range of IPv6 addresses that we can route on the Internet.  A single /64 will be sufficient.  We use this to define a new Subnet in Neutron:

neutron subnet-create --ip-version 6 --name private-ipv6 \
  --ipv6-ra-mode dhcpv6-stateless --ipv6-address-mode dhcpv6-stateless \
  private 2001:620:5ca1:80f0::/64

Note that we use dhcpv6-stateless for both ra-mode and address-mode.  This will actually use SLAAC (stateless address autoconfiguration) and router advertisements to configure IPv6 on the instance.  Stateless DHCPv6 could be used to convey information such as name server addresses, but I don’t think we’re actively using that now.

We should now see a radvd process running in an appropriate namespace on the network node.  And instances—both new and pre-existing!—will start to get IPv6 addresses if they are configured to use SLAAC, as is the default for most modern OSes.

Create a new port on the shared router to connect the IPv6 Subnet

Next, we need to add a port to the shared private-router that connects this new subnet with the outside world via the public network:

neutron router-interface-add private-router private-ipv6

Configure a return route on each upstream router

Now the outside world also needs a route back to our IPv6 subnet.  The subnet is already part of a larger aggregate that is routed toward our two upstream routers.  It is sufficient to add a static route for our subnet on each of them.  But where do we point that route to, i.e. what should be the “next hop”? We use the link-local address of the external (gateway) port of our Neutron router, which we can find out by looking inside the namespace for the router on the network node.  Our router private-router has UUID 2b8d1b4f-1df1-476a-ab77-f69bb0db3a59.  So we can run the following command on the network node:

$ sudo ip netns exec qrouter-2b8d1b4f-1df1-476a-ab77-f69bb0db3a59 ip -6 addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
55: qg-2d73d3fb-f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80fd:f816:3eff:fe00:30d7/64 scope global dynamic
 valid_lft 2591876sec preferred_lft 604676sec
 inet6 fe80::f816:3eff:fe00:30d7/64 scope link
 valid_lft forever preferred_lft forever
93: qr-02b9a67d-24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 2001:620:5ca1:80f0::1/64 scope global
 valid_lft forever preferred_lft forever
 inet6 fe80::f816:3eff:fe7d:755b/64 scope link
 valid_lft forever preferred_lft forever
98: qr-6aaf629f-19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
 inet6 fe80::f816:3eff:feb6:85f4/64 scope link
 valid_lft forever preferred_lft forever

The port we’re looking for is the one whose name starts with qr-, the gateway port.  The address we’re looking for is the one starting with fe80:, the link-local address.

The “internal” subnet has address 2001:620:5ca1:80f0::/64, and VLAN 908 (ve908 in router-ese) is the VLAN that connects our network node to the upstream router.  So this is what we configure on each of our routers using the “industry-standard CLI”:

ipv6 route 2001:620:5ca1:80f0::/64 ve 908 fe80::f816:3eff:fe00:30d7

And we’re done! IPv6 packets can flow between instances on our private network and the Internet.

Coming up

Of course this is not the end of the story.  While our customers were mostly happy that they suddenly got IPv6, there are a few surprises that came up.  In a future episode, we’ll tell you more about them and how they can be addressed.


IPv6 Finally Arriving on SWITCHengines

As you may have heard or noticed, the Internet is running out of addresses. It’s time to upgrade from the 35 years old IPv4 protocol, which doesn’t even have a single public address per human on the earth, to the brand new (?) IPv6, which offers enough addresses for every grain of sand in the known universe, or something like that.

SWITCH is a pioneer in IPv6 adoption, and has been supporting IPv6 on all network connections and most services in parallel with IPv4 (“dual stack”) for many years.

To our embarrassment, we hadn’t been able to integrate IPv6 support into SWITCHengines from the start. While OpenStack had some IPv6 support, the implementation wasn’t mature, and we didn’t know how to fit it into our network model in a user-friendly way.

IPv6: “On by default” and globally routable

About a month ago we took a big step to change this: IPv6 is now enabled by default for all instances on the shared internal network (“private”).  So if you have an instance running on SWITCHengines, and it isn’t connected to a tenant network of your own, then the instance probably has an IPv6 address right now, in addition to the IPv4 address(es) it always had.  Note that this is true even for instances that were created or last rebooted before we turned on IPv6. On Linux-derived systems you can check using ifconfig eth0 or ip -6 addr list dev eth0; if you see an address that starts with 2001:620:5ca1:, then your instance can speak IPv6.

Note that these IPv6 addresses are “globally unique” and routable, i.e. they are recognized by the general Internet.  In contrast, the IPv4 addresses on the default network are “private” and can only be used locally inside the cloud; communication with the general Internet requires Network Address Translation (NAT).

What you can do with an IPv6 address

Your instance will now be able to talk to other Internet hosts over IPv6. For example, try ping6 mirror.switch.ch or traceroute6 www.facebook.com. This works just like IPv4, except that only a subset of hosts on the Internet speaks IPv6 yet. Fortunately, this subset already includes important services and is growing.  Because IPv6 doesn’t need NAT, routing between your instances and the Internet is less resource-intensive and a tiny bit faster than with IPv4.

But you will also be able to accept connections from other Internet hosts over IPv6. This is different from before: To accept connections over IPv4, you need(ed) a separate public address, a Floating IP in OpenStack terminology.  So if you can get by with IPv6, for example because you only need (SSH or other) access from hosts that have IPv6, then you don’t need to reserve a Floating IP anymore.  This saves you not just work but also money—public IPv4 addresses are scarce, so we need to charge a small “rent” for each Floating IP reserved.  IPv6 addresses are plentiful, so we don’t charge for them.

But isn’t this dangerous?

Instances are now globally reachable by default, but they are still protected by OpenStack’s Security Groups (corresponding to packet filters or access control lists).  The default Security Group only allows outbound connections: Your instance can connect to servers elsewhere, but attempts to connect to your instance will be blocked.  You have probably opened some ports such as TCP port 22 (for SSH) or 80 or 443 (for HTTP/HTTPS) by adding corresponding rules to your own Security Groups.  In these rules, you need to specify address “prefixes” specifying where you want to accept traffic from.  These prefixes can be IPv4 or IPv6—if you want to accept both, you need two rules.

If you want to accept traffic from anywhere, your rules will contain 0.0.0.0/0 as the prefix. To accept IPv6 traffic as well, simply add identical rules with ::/0 as the prefix instead—this is the IPv6 version of the “global” prefix.

What about domain names?

These IPv6 addresses can be entered in the DNS using “AAAA” records. For Floating IPs, we provided pre-registered hostnames of the form fl-34-56.zhdk.cloud.switch.ch. We cannot do that in IPv6, because there are just too many possible addresses. If you require your IPv6 address to map back to a hostname, please let us know and we can add it manually.

OpenStack will learn how to (optionally) register such hostnames in the DNS automatically; but that feature was only added to the latest release (“Mitaka”), and it will be several months before we can deploy this in SWITCHengines.

Upcoming

We would like to also offer IPv6 connectivity to user-created “tenant networks”. Our version of OpenStack almost supports this, but it cannot be fully automated yet. If you need IPv6 on your non-shared network right now, please let us know via the normal support channel, and we’ll set something up manually. But eventually (hopefully soon), getting a globally routable IPv6 prefix for your network should be (almost) as easy as getting a globally routable Floating IP is now.

You can also expect services running on SWITCHengines (SWITCHdrive, SWITCHfilesender and more) to become exposed over IPv6 over the next couple of months. Stay tuned!


New version, new features

We are constantly working on SWITCHengines, updating, tweaking stuff. Most of the time, little of this process is visible to users, but sometimes we release features that make a difference in the user experience.

A major change was the upgrade to OpenStack Kilo that we did mid March. OpenStack is the software that powers our cloud, and it gets an update every 6 months. The releases are named alphabetically. Our clouds history started with the “Icehouse” release, moved to “Juno” and now we are on “Kilo”. Yesterday “Mitaka” was released, so we are 2 releases (or 12 months) behind.

Upgrading the cloud infrastructure is major work. Our goal is to upgrade “in place” with all virtual machines running uninterrupted during the upgrade. Other cloud operators install a new version on minimal hardware, then start to migrate the customer machines one by one to the new hardware, and converting the hypervisors one by one. This is certainly feasible, but it causes downtime – something we’d like to avoid.

Therefore we spend a lot of time, testing the upgrade path. The upgrade from “Icehouse” to “Juno”took over 6 months (first we needed to figure out how to do the upgrade in the first place, then had to implement and test it). The upgrade from “Juno” to “Kilo” then only took 4 months (with x-mas and New Year in it). Now we are working on the upgrade to “Liberty” which is planned to happen before June / July. This time, we plan to be even faster, because we are going to upgrade the many components of OpenStack individually. The just release “Mitaka” release should be done before “Newton” is release in October. Our plan is to be at most 6 months behind the official release schedule.

So what does Kilo bring you, the end user? A slightly different user interface, loads of internal changes and a few new major features:

There is also stuff coming in the next few weeks:

  • Access to the SWIFT object store
  • Backup of Volumes (that is something we are testing right now)
  • IPv6 addresses for virtual machines

We have streamlined the deployment process of changes – while we did releases once a week during the last year, we now can deploy new features as soon as they are finished and tested.

 


3 Comments

Upgrading a Ceph Cluster from 170 to 200 Disks, in One Image

The infrastructure underlying SWITCHengines includes two Ceph storage clusters, one in Lausanne and one in Zurich. The Zurich one (which notably serves SWITCHdrive) filled up over the past year. In December 2015 we acquired new servers to upgrade its capacity.

The upgrade involves the introduction of a new “leaf-spine” network architecture based on “whitebox” switches and Layer-3 (IP) routing to ensure future scalability. The pre-existing servers are still connected to the “old” network consisting of two switches and a single Layer 2 (Ethernet) domain.

First careful steps: 160→161→170

This change in network topology, and in particular the necessity to support both the old and new networks, caused us to be very careful when adding the new servers. The old cluster consisted of 160 Ceph OSDs, running on sixteen servers with ten 4TB hard disks each. We first added a single server with a single disk (OSD) and observed that it worked well. Then we added nine more OSDs on that first new server to bring the cluster total up to 170 OSDs. That also worked flawlessly.

Now for real: 170→200

As the next step, we added three new servers with ten disks each to the cluster at once, to bring the total OSD count from 170 to 200. We did this over the weekend because it causes a massive shuffling of data within the cluster, which slows down normal user I/O.

What should we expect to happen?

All in all, 28.77% of the existing storage objects in the system had to be migrated, corresponding to about 106 Terabytes of raw data. Most of the data movement is from the 170 old towards the 30 new disks.

How long should this take? One can make some back-of-the-envelope calculations. In a perfect world, writing 106 Terabytes to 30 disks, each of which sustains a write rate of 170 MB/s, would take around 5.8 hours. In Ceph, every byte written to an OSD has to go through a persistent “journal”, which is implemented using an SSD (flash-based solid-state disk). Our systems have two SSDs, each of which sustains a write rate of about 520 MB/s. Taking this bottleneck into account, the lower bound increases to 9.5 hours.

However this is still a very theoretical number, because it fails to include many other bottlenecks and types of overhead: disk controller and bus capacity limitations, processing overhead, network delays, reading data from the old disks etc. But most importantly, the Ceph cluster is actively used, and performs other maintenance tasks such as scrubbing, all of which competes with the movement of data to the new disks.

What do we actually see?

Here is a graph that illustrates what happens after the 30 new disks (OSDs) are added:

df_170+30

The y axis is the disk usage (as per output of the df command). The thin grey lines—there are 170 of them—correspond to each of the old OSDs. The thin red lines correspond to the 30 new OSDs. The blue line is the average disk usage across the old OSDs, the green line the average of the new OSDs. At the end of the process, the blue and green line should (roughly) meet.

So in practice, the process takes about 30 hours. In perspective, this is still quite fast and corresponds to a mean overall data-movement rate of about 1 GB/s or 8 Gbit/s. The green and blue lines show that the overall process seems very steady as it moves data from the old to the new OSDs.

Looking at the individual line “bundles”, we see that the process is not all that homogeneous. First, even within in the old line bundle, we see quite a bit of variation across the fill levels of the 170 disks. There is some variation at the outset, and it seems to get worse throughout the process. An interesting case is the lowest grey line—this is an OSD that has significantly less data that the others. I had hoped that the reshuffling would be an opportunity to make it approach the others (by shedding less data), but the opposite happened.

Anyway, a single under-utilized disk is not a big problem. Individual over-utilized disks are a problem, though. And we see that there is one OSD that has significantly higher occupancy. We can address this by explicit “reweighting” if and when this becomes a problem as the cluster fills up again. But then, we still have a couple of disk servers that we can add to the cluster over the coming months, to make sure that overall utilization remains in a comfortable range.

Coda

The graph above has been created using Graphite with the following graph definition:

[
 {
 "target": [
 "lineWidth(alpha(color(collectd.zhdk00{06,07,11,15,17,18,19,20,21,22,23,24,27,29,30,32,43}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used,'black'),0.5),0.5)",
 "lineWidth(color(collectd.zhdk00{44,51,52}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used,'red'),0.5)",
 "lineWidth(color(avg(collectd.zhdk00{06,07,11,15,17,18,19,20,21,22,23,24,27,29,30,32,43}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used),'blue'),2)",
 "lineWidth(color(avg(collectd.zhdk00{44,51,52}_*.df-var-lib-ceph-osd-ceph-*.df_complex-used),'green'),2)"
 ],
 "height": 600
 }
]

df_160+1+9+30The base data was collected by CollectD’s standard “df” plugin. zhdk00{44,51,52} are the new OSD servers, the others are the pre-existing ones.

Zooming out a bit shows the previous small extension steps mentioned above. As you see, adding nine disks doesn’t take much longer than adding a single one.

 

 


Server Power Measurement: Quick Experiment

In December 2015, we received a set of servers to extend the infrastructure that powers SWITCHengines (and indirectly SWITCHdrive, SWITCHfilesender and other services).  Putting these in production will take some time, because this also requires a change in our network setup, but users should start benefiting from it starting in February.

Before the upgrade, we used a single server chassis type for both “compute” nodes—i.e. where SWITCHengines instances are executed as virtual machines—and “storage” nodes where all the virtual disks and other persistent objects are stored.  The difference simply was that some servers were full of high-capacity disks, where the others had many empty slots.  We knew this was wasteful in terms of rack utilization, but it gave us more flexibility while we were learning how our infrastructure was used.

The new servers are different: Storage nodes look very much like the old storage nodes (which, as mentioned, look very similar to the old compute nodes), just with newer motherboard and newer (but also fewer and less powerful) processors.

The compute nodes are very different though: The chassis have the same size as the old ones, but instead of one server or “node”, the new compute chassis contain four.  All four nodes in a chassis share the same set of power supplies and fans, two of each for redundancy.

Now we use tools such as IPMI to remotely monitor our infrastructure to make sure we notice when fans or power supplies fail, or temperature starts to increase to concerning levels.  Each server has a “Baseboard Management Controller” (BMC) that exposes a set of sensors for that.  The BMC also allows resetting or even powering down/up the server (except for the BMC itself!), and getting to the serial or graphical console over the network, all of which can be useful for maintenance.

Each node has its own BMC, and each BMC gives sensor information about the (two) power supplies.  This is a little weird because there are only two power supplies in the chassis, but we can monitor eight—two per node/BMC, of which there are four.  Which raises some doubts: Am I measuring the two power supplies in the chassis at all? Or are the measurements from some kind of internal power supplies that each node has (and that feeds from the central power supplies)?

As a small experiment, I started with a chassis that had all four nodes powered up and running.  I started polling the power consumption readings on one of the four servers every ten seconds.  While that was running, I shut down the three other servers.  Here are the results:

$ while true; do date; \
  sudo ipmitool sensor list | grep 'Power In'; \
  sleep 8; done
Thu Jan 14 12:53:34 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:43 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:53:53 CET 2016
PS1 Power In | 310.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:02 CET 2016
PS1 Power In | 320.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:11 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:20 CET 2016
PS1 Power In | 240.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:30 CET 2016
PS1 Power In | 180.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:39 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
Thu Jan 14 12:54:48 CET 2016
PS1 Power In | 110.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na
PS2 Power In | 10.000 | Watts | ok | na | na | na | 2030.000 | 2300.000 | na

One observation is that the resolution of the power measurement seems to be 10W.  Another observation is that PS2 consistently draws 10W—which might mean anything between 5 and 15.  Obviously the two power supplies function in active/standby modes and PS1 is the active one.

But the central result is that the power draw of PS1 falls from 310W when all four nodes are running (but not really doing much outside running the operating system) to 110W when only one is running.  This suggests that we’re actually measuring the shared power supplies, and not something specific to the node we were polling.  It also suggests that each node consumes about 70W in this “baseline” state, and that there is a base load of 40W for the chassis.  Of course these numbers are highly unscientific and imprecise, given the trivial number of experiments (one) and the bad sensor resolution and, presumably, precision.

view from Lungarno Pacinotti on the river Arno


Impressions from 19th TF-Storage workshop in Pisa

National Research and Education Networks (NRENs) such as SWITCH exist in every European country. They have a long tradition of working together. An example for this are Task Forces on different topics under the umbrella of the GÉANT Association (formerly TERENA). One of them is TF-Storage, which since 2008 has been a forum to exchange knowledge about various storage technologies and their application in the NREN/academic IT context. Its 19th meeting took place in Pisa last week (13/14 October). It was the first one that I attended on site. But I had been following the group via its mailing list for several years, and the agenda included several topics relevant to our work, so I was looking forward to learning from the presentations and to chatting with people from other NRENs (and some universities) who run systems similar to ours.

Getting there

Zurich is extremely well connected transport-wise, but getting to Pisa without spending an extra night proved to be challenging. I decided to take an early flight to Florence, then drive a rented car to Pisa. That went smoothly until I got a little lost in the suburbs of Pisa, but after two rounds on the one-way lungarni (Arno promenades) I finally had the car parked at the hotel and walked the 100m or so to the venue at the university. Unfortunately I arrived at the meeting more than an hour after it had started.

view from Lungarno Pacinotti on the river Arno

View of the river Arno from Lungarno Pacinotti. The meeting venue is one of the buildings on the right.

Day 1: Ceph, Ceph, Ceph…

The meeting started with two hours of presentations by Joao Eduardo Luis from SUSE about various aspects of Ceph, a distributed file system that we use heavily in SWITCHengines. In the part that I didn’t miss, Joao talked about numerous new features in different stages of development. Sometimes I think it would be better to make the current functionality more robust and easier to use. Especially the promise of more tuning knobs being added seems unattractive to me—from an operator’s point of view it would be much nicer if less tuning were necessary.

The ensuing round-table discussion was interesting. Clearly several people in the room had extensive experience with running Ceph clusters. Especially Panayiotis Gotsis from GRNET asked many questions which showed a deep familiarity with the system.

Next, Axel Rosenberg from Sandisk talked about their work on optimizing Ceph for use with Flash (SSD) storage. Sandisk has built a product called “IFOS” based on Ubuntu GNU/Linux and an enhanced version of Ceph. They identified many bottlenecks in the Ceph code that show up when the disk bottleneck is lifted by use of fast SSDs. Sandisk’s changes resulted in speedup of some benchmarks by a factor of ten—notably with the same type of disks. The improvements will hopefully find their way into “upstream” Ceph and be thoroughly quality-assured. The most interesting slide to me was about work to reduce the impact of recovery from a failed disk. By adding some priorization (I think), they were able to massively improve performance of user I/O during recovery—let’s say rather than being ten times slower than usual, it would only be 40% slower—while the recovery process took only a little bit longer than without the priorization. This is an area that needs a lot of work in Ceph.

Karan Singh from CSC (which is “the Finnish SWITCH”, but also/primarily “the Finnish CSCS”) presented how CSC uses Ceph as well as their Ceph dashboard. Karan has actually written a book on Ceph! CSC plans to use Ceph as a basis for two OpenStack installations, cPouta (classic public/community cloud service) and ePouta (for sensitive research data). They have been doing extensive research of Ceph including some advanced features such as Erasure Coding—which we don’t consider for SWITCHengines just yet. Karan also talked about tuning the system and diagnosing issues, which can lead to discover low-level problems such as network cabling issues in one case he reported.

Simone Spinelli from the hosting university of Pisa talked about how they use Ceph to support an OpenStack based virtual machine hosting service. I discovered that they did many things in a similar way to us, using Puppet, Foreman, Graphite to support installation and operation of their system. An interesting twist is they have multiple smaller sites distributed across the city, and their Ceph cluster spans these sites. In contrast, at SWITCH we operate separate clusters in our two locations in Lausanne and Zurich. There are several technical reasons for doing so, although we consider adding a third cluster that would span the two locations (and adding a tiny third one) for special applications that require resilience against the total failure of a data center or its connection to the network.

Day 2: Scality, OpenStack, ownCloud

The second day was opened by Bradley King from Scality presenting on object stores vs. file stores. This was a wonderful presentation that would be worth a blog post of its own. Although it was naturally focused on Scality’s “RING” product, it didn’t come over as marketing at all, and contained many interesting insights about distributed storage design trade-offs, stories from actual deployments—Scality has several in the multi-Petabyte range—and also some future perspectives, for example about “IP drives”. These are disk drives with Ethernet/IP interfaces rather than the traditional SATA or SAS attachments, and which support S3-like object interfaces. What was new to me was that new disk technologies such as SMR (shingled magnetic recording) and HAMR (heat-assisted magnetic recording) seem to be driving disk vendors towards this kind of interface, as traditional block semantics are becoming quite hard to emulate with these types of disk. My takeaway was that Scality RING looks like a well-designed system, similarly elegant as Ceph, but with some trade-offs leaning towards simplicity and operational ease. To me the big drawback compared to Ceph is that it (like several other “software-defined storage” systems) is closed-source.

The following three were about collaboration activities between NRENs (and, in some cases, vendors):

Maciej Brzeźniak from PSNC (the Polish “SWITCH+CSCS”) talked about the TCO Calculator for (mainly Ceph-based) software-defined storage systems that some TF-Storage members have been working on for several months. Maciej is looking for more volunteers to contribute data to it. One thing that is missing are estimates for network (port) costs. I volunteered to provide some numbers for 10G/40G leaf/spine networks built from “whitebox” switches, because we just went through a procurement exercise for such a project.

Next, yours truly talked about the OSO get-together, a loosely organized group of operators of OpenStack-based IaaS installations that meets every other Friday over videoconferencing. I talked about how the group evolved and how it works, and suggested that this could serve as a blueprint for closer cooperation between some TF-Storage members on some specific topics like building and running Ceph clusters. Because there is significant overlap between the OSO (IaaS) and (in particular Ceph) storage operators, we decided that interested TF-Storage people should join the OSO mailing list and the meetings, and that we see where this will take us. [The next OSO meeting was two days later, and a few new faces showed up, mostly TF-Storage members, so it looks like this could become a success.]

Finally Peter Szegedi from the GÉANT Association talked about the liaison with OpenCloudMesh, which is one aspect of a collaboration of various NRENs (including AARnet from Australia) and other organizations (such as CERN) who use the ownCloud software to provide file synchronization and sharing service to their users. SWITCH also participates in this collaboration, which lets us share our experience running the SWITCHdrive service, and in return provides us with valuable insights from others.

The meeting closed with the announcement that the next meeting would be in Poznań at some date to be chosen later, carefully avoiding clashes with the OpenStack meeting in April 2016. Lively discussions ensued after the official end of the meeting.

Getting back

Driving back from Pisa to Florence airport turned out to be interesting, because the rain, which had been intermittent, had become quite heavy during the day. Other than that, the return trip was uneventful. Unfortunately I didn’t even have time to see the leaning tower, although it would probably have been a short walk from the hotel/venue. But the tiny triangle between meeting venue, my hotel, and the restaurant where we had dinner made a very pleasant impression on me, so I’ll definitely try to come back to see more of this city.

rainy-small

Waiting if the car in front of me makes it safely through the flooded stretch under the bridge… yup, it did.