SWITCH Cloud Blog

Upgrading a Ceph Cluster from 170 to 200 Disks, in One Image


The infrastructure underlying SWITCHengines includes two Ceph storage clusters, one in Lausanne and one in Zurich. The Zurich one (which notably serves SWITCHdrive) filled up over the past year. In December 2015 we acquired new servers to upgrade its capacity.

The upgrade involves the introduction of a new “leaf-spine” network architecture based on “whitebox” switches and Layer-3 (IP) routing to ensure future scalability. The pre-existing servers are still connected to the “old” network consisting of two switches and a single Layer 2 (Ethernet) domain.

First careful steps: 160→161→170

This change in network topology, and in particular the necessity to support both the old and new networks, caused us to be very careful when adding the new servers. The old cluster consisted of 160 Ceph OSDs, running on sixteen servers with ten 4TB hard disks each. We first added a single server with a single disk (OSD) and observed that it worked well. Then we added nine more OSDs on that first new server to bring the cluster total up to 170 OSDs. That also worked flawlessly.

Now for real: 170→200

As the next step, we added three new servers with ten disks each to the cluster at once, to bring the total OSD count from 170 to 200. We did this over the weekend because it causes a massive shuffling of data within the cluster, which slows down normal user I/O.

What should we expect to happen?

All in all, 28.77% of the existing storage objects in the system had to be migrated, corresponding to about 106 Terabytes of raw data. Most of the data movement is from the 170 old towards the 30 new disks.

How long should this take? One can make some back-of-the-envelope calculations. In a perfect world, writing 106 Terabytes to 30 disks, each of which sustains a write rate of 170 MB/s, would take around 5.8 hours. In Ceph, every byte written to an OSD has to go through a persistent “journal”, which is implemented using an SSD (flash-based solid-state disk). Our systems have two SSDs, each of which sustains a write rate of about 520 MB/s. Taking this bottleneck into account, the lower bound increases to 9.5 hours.

However this is still a very theoretical number, because it fails to include many other bottlenecks and types of overhead: disk controller and bus capacity limitations, processing overhead, network delays, reading data from the old disks etc. But most importantly, the Ceph cluster is actively used, and performs other maintenance tasks such as scrubbing, all of which competes with the movement of data to the new disks.

What do we actually see?

Here is a graph that illustrates what happens after the 30 new disks (OSDs) are added:


The y axis is the disk usage (as per output of the df command). The thin grey lines—there are 170 of them—correspond to each of the old OSDs. The thin red lines correspond to the 30 new OSDs. The blue line is the average disk usage across the old OSDs, the green line the average of the new OSDs. At the end of the process, the blue and green line should (roughly) meet.

So in practice, the process takes about 30 hours. In perspective, this is still quite fast and corresponds to a mean overall data-movement rate of about 1 GB/s or 8 Gbit/s. The green and blue lines show that the overall process seems very steady as it moves data from the old to the new OSDs.

Looking at the individual line “bundles”, we see that the process is not all that homogeneous. First, even within in the old line bundle, we see quite a bit of variation across the fill levels of the 170 disks. There is some variation at the outset, and it seems to get worse throughout the process. An interesting case is the lowest grey line—this is an OSD that has significantly less data that the others. I had hoped that the reshuffling would be an opportunity to make it approach the others (by shedding less data), but the opposite happened.

Anyway, a single under-utilized disk is not a big problem. Individual over-utilized disks are a problem, though. And we see that there is one OSD that has significantly higher occupancy. We can address this by explicit “reweighting” if and when this becomes a problem as the cluster fills up again. But then, we still have a couple of disk servers that we can add to the cluster over the coming months, to make sure that overall utilization remains in a comfortable range.


The graph above has been created using Graphite with the following graph definition:

 "target": [
 "height": 600

df_160+1+9+30The base data was collected by CollectD’s standard “df” plugin. zhdk00{44,51,52} are the new OSD servers, the others are the pre-existing ones.

Zooming out a bit shows the previous small extension steps mentioned above. As you see, adding nine disks doesn’t take much longer than adding a single one.



3 thoughts on “Upgrading a Ceph Cluster from 170 to 200 Disks, in One Image

  1. Hi Simon, that’s very useful information, thanks for the writeup! Would you mind sharing your CRUSH tunables setting, and changes you made to your CRUSH tunables during the upgrade, if any?

    Liked by 1 person

    • Sorry for the late reply—I needed Saverio to tell me how to extract those. I hope WordPress doesn’t mess up the formatting too badly:

      $ ceph osd crush show-tunables
      “choose_local_tries”: 0,
      “choose_local_fallback_tries”: 0,
      “choose_total_tries”: 50,
      “chooseleaf_descend_once”: 1,
      “chooseleaf_vary_r”: 0,
      “straw_calc_version”: 1,
      “allowed_bucket_algs”: 22,
      “profile”: “unknown”,
      “optimal_tunables”: 0,
      “legacy_tunables”: 0,
      “require_feature_tunables”: 1,
      “require_feature_tunables2”: 1,
      “require_feature_tunables3”: 0,
      “has_v2_rules”: 0,
      “has_v3_rules”: 0,
      “has_v4_buckets”: 0