National Research and Education Networks (NRENs) such as SWITCH exist in every European country. They have a long tradition of working together. An example for this are Task Forces on different topics under the umbrella of the GÉANT Association (formerly TERENA). One of them is TF-Storage, which since 2008 has been a forum to exchange knowledge about various storage technologies and their application in the NREN/academic IT context. Its 19th meeting took place in Pisa last week (13/14 October). It was the first one that I attended on site. But I had been following the group via its mailing list for several years, and the agenda included several topics relevant to our work, so I was looking forward to learning from the presentations and to chatting with people from other NRENs (and some universities) who run systems similar to ours.
Zurich is extremely well connected transport-wise, but getting to Pisa without spending an extra night proved to be challenging. I decided to take an early flight to Florence, then drive a rented car to Pisa. That went smoothly until I got a little lost in the suburbs of Pisa, but after two rounds on the one-way lungarni (Arno promenades) I finally had the car parked at the hotel and walked the 100m or so to the venue at the university. Unfortunately I arrived at the meeting more than an hour after it had started.
Day 1: Ceph, Ceph, Ceph…
The meeting started with two hours of presentations by Joao Eduardo Luis from SUSE about various aspects of Ceph, a distributed file system that we use heavily in SWITCHengines. In the part that I didn’t miss, Joao talked about numerous new features in different stages of development. Sometimes I think it would be better to make the current functionality more robust and easier to use. Especially the promise of more tuning knobs being added seems unattractive to me—from an operator’s point of view it would be much nicer if less tuning were necessary.
The ensuing round-table discussion was interesting. Clearly several people in the room had extensive experience with running Ceph clusters. Especially Panayiotis Gotsis from GRNET asked many questions which showed a deep familiarity with the system.
Next, Axel Rosenberg from Sandisk talked about their work on optimizing Ceph for use with Flash (SSD) storage. Sandisk has built a product called “IFOS” based on Ubuntu GNU/Linux and an enhanced version of Ceph. They identified many bottlenecks in the Ceph code that show up when the disk bottleneck is lifted by use of fast SSDs. Sandisk’s changes resulted in speedup of some benchmarks by a factor of ten—notably with the same type of disks. The improvements will hopefully find their way into “upstream” Ceph and be thoroughly quality-assured. The most interesting slide to me was about work to reduce the impact of recovery from a failed disk. By adding some priorization (I think), they were able to massively improve performance of user I/O during recovery—let’s say rather than being ten times slower than usual, it would only be 40% slower—while the recovery process took only a little bit longer than without the priorization. This is an area that needs a lot of work in Ceph.
Karan Singh from CSC (which is “the Finnish SWITCH”, but also/primarily “the Finnish CSCS”) presented how CSC uses Ceph as well as their Ceph dashboard. Karan has actually written a book on Ceph! CSC plans to use Ceph as a basis for two OpenStack installations, cPouta (classic public/community cloud service) and ePouta (for sensitive research data). They have been doing extensive research of Ceph including some advanced features such as Erasure Coding—which we don’t consider for SWITCHengines just yet. Karan also talked about tuning the system and diagnosing issues, which can lead to discover low-level problems such as network cabling issues in one case he reported.
Simone Spinelli from the hosting university of Pisa talked about how they use Ceph to support an OpenStack based virtual machine hosting service. I discovered that they did many things in a similar way to us, using Puppet, Foreman, Graphite to support installation and operation of their system. An interesting twist is they have multiple smaller sites distributed across the city, and their Ceph cluster spans these sites. In contrast, at SWITCH we operate separate clusters in our two locations in Lausanne and Zurich. There are several technical reasons for doing so, although we consider adding a third cluster that would span the two locations (and adding a tiny third one) for special applications that require resilience against the total failure of a data center or its connection to the network.
Day 2: Scality, OpenStack, ownCloud
The second day was opened by Bradley King from Scality presenting on object stores vs. file stores. This was a wonderful presentation that would be worth a blog post of its own. Although it was naturally focused on Scality’s “RING” product, it didn’t come over as marketing at all, and contained many interesting insights about distributed storage design trade-offs, stories from actual deployments—Scality has several in the multi-Petabyte range—and also some future perspectives, for example about “IP drives”. These are disk drives with Ethernet/IP interfaces rather than the traditional SATA or SAS attachments, and which support S3-like object interfaces. What was new to me was that new disk technologies such as SMR (shingled magnetic recording) and HAMR (heat-assisted magnetic recording) seem to be driving disk vendors towards this kind of interface, as traditional block semantics are becoming quite hard to emulate with these types of disk. My takeaway was that Scality RING looks like a well-designed system, similarly elegant as Ceph, but with some trade-offs leaning towards simplicity and operational ease. To me the big drawback compared to Ceph is that it (like several other “software-defined storage” systems) is closed-source.
The following three were about collaboration activities between NRENs (and, in some cases, vendors):
Maciej Brzeźniak from PSNC (the Polish “SWITCH+CSCS”) talked about the TCO Calculator for (mainly Ceph-based) software-defined storage systems that some TF-Storage members have been working on for several months. Maciej is looking for more volunteers to contribute data to it. One thing that is missing are estimates for network (port) costs. I volunteered to provide some numbers for 10G/40G leaf/spine networks built from “whitebox” switches, because we just went through a procurement exercise for such a project.
Next, yours truly talked about the OSO get-together, a loosely organized group of operators of OpenStack-based IaaS installations that meets every other Friday over videoconferencing. I talked about how the group evolved and how it works, and suggested that this could serve as a blueprint for closer cooperation between some TF-Storage members on some specific topics like building and running Ceph clusters. Because there is significant overlap between the OSO (IaaS) and (in particular Ceph) storage operators, we decided that interested TF-Storage people should join the OSO mailing list and the meetings, and that we see where this will take us. [The next OSO meeting was two days later, and a few new faces showed up, mostly TF-Storage members, so it looks like this could become a success.]
Finally Peter Szegedi from the GÉANT Association talked about the liaison with OpenCloudMesh, which is one aspect of a collaboration of various NRENs (including AARnet from Australia) and other organizations (such as CERN) who use the ownCloud software to provide file synchronization and sharing service to their users. SWITCH also participates in this collaboration, which lets us share our experience running the SWITCHdrive service, and in return provides us with valuable insights from others.
The meeting closed with the announcement that the next meeting would be in Poznań at some date to be chosen later, carefully avoiding clashes with the OpenStack meeting in April 2016. Lively discussions ensued after the official end of the meeting.
Driving back from Pisa to Florence airport turned out to be interesting, because the rain, which had been intermittent, had become quite heavy during the day. Other than that, the return trip was uneventful. Unfortunately I didn’t even have time to see the leaning tower, although it would probably have been a short walk from the hotel/venue. But the tiny triangle between meeting venue, my hotel, and the restaurant where we had dinner made a very pleasant impression on me, so I’ll definitely try to come back to see more of this city.