SWITCHengines storage server evolution
- First generation, since March 2014: 2U Dalco based on Intel S2600GZ, 2×E5-2650v2 CPUs, 128GB RAM, 2×200GB Intel DC S3610 SSD, 12×WD SE 4TB
- Second generation, since Dec. 2015: 2U Dalco based on Intel S2600WTT, 1×E5-2620v4 CPU, 64GB RAM, 2×200GB Intel DC S3610 SSD, 12×WD SE 4TB
- Third generation, since June 2017: 1U Quanta S1Q-1ULH-8, 1×Xeon D-1541 CPU, 64GB RAM, 2×240GB Micron 5100 MAX SSD, 12×HGST Ultrastar He8 (TB)
What all servers have in common: 2×10GE (SFP+ DAC) network connections, redundant power supplies, simple BMC modules connected to separate GigE network.
We run all those servers together in a single large Ceph RADOS cluster (actually we have two clusters in different towns, but for this article I focus on just the larger and more heavily loaded one). The cluster has 480 OSDs, contains about 500TiB user data, mostly RBD block devices used by OpenStack instances, and some S3 object storage, including video streaming directly from RadosGW to browsers. Cluster-wide I/O rates during my measurements were around 2’700 IOPS, 150MB/s read, 65MB/s write. We didn’t apply any particular optimization for energy or otherwise.
To understand the story behind the server types: Initially we used the same server chassis for compute and storage servers. We also used the same relatively generous CPU and RAM configurations. This would have allowed us to turn compute into storage servers (or vice-versa) relatively easily. When purchasing the second server generation we saved some money by reducing CPU power and RAM. For the third generation, we opted for increased density and efficiency made possible by a “system-on-a-chip” (Xeon D)-based server design.
Power measurement results
All these servers have IPMI-accessible power sensors. Last week my colleagues did some measurement with an external power meter, and found that (for the servers they tested—not all types, for lack of time) the values from the IPMI readers are within 5% or so of the values from the “real” power meter. Good enough!
Unfortunately we don’t yet feed IPMI measurements into any of our continuous measurement tools (Carbon/Graphite/Grafana or Nagios). If you do that, please use the comments to tell us how you set this up.
But recently I looked at the IPMI power consumption readings for these servers during a time of relatively light use (weekend) and got the following results:
- Gen 1: 248W
- Gen 2: 205W
- Gen 3: 155W
Note that the Gen 3 servers have larger disks, and thus Ceph puts twice as much data on them, and thus they get double the IOPS than the old servers. Still, they use significantly less power. This is partly due to the simplified mainboard and more modern CPU, partly to the Helium-filled disks that only draw ~4.5W each (when idle) as opposed to ~7.5W for the older 4TB drives.
Cost to power a Terabyte-year of user data
Just for fun, I also performed some cost calculations in relation to usable space, under the following assumptions:
- We pay 0.15 €/kWh. (actually I used CHF, but doesn’t really matter—some countries will pay more than this even without overhead, others pay only half, so this would cover some of the other directly energy-dependent costs like A/C and redundancy. Anyway it’s about the relative costs 🙂
- We can fill disks up to an average of 70% before things get messy.
When using traditional three-way replication, storing a usable Terabyte for a year costs us the following amounts just for power:
- Gen 1: € 29.10
- Gen 2: € 24.05
- Gen 3: € 9.09 (note again that these have twice the capacity)
If we assume Erasure Coding with 50% overhead, e.g. 2+1, then the power cost would go down to
- Gen 1: € 14.55
- Gen 2: € 12.03
- Gen 3: € 4.55
We could consider even more space-efficient EC configurations, but I don’t have any experience with that… “left as an exercise for the reader”.
In conclusion, we could say that advances in hardware (more efficient servers, larger and more efficient disks), software (EC in Ceph), as well as our own optimizations (less spare CPU/RAM) have brought down the power component of our storage costs by a factor of 6.5 over these 3.5 years. Not bad, huh? Of course there are trade-offs: the new servers have lower IOPS per space, EC uses more CPU and disk-read operations etc.
The next frontier: powering down idle disks
Finally, I also took an unused server of the Gen 3 ones (we keep a few powered down until we need them—I powered one on for the test). It consumed 136W, not much less than the 155W under (light) load. Although the 12 8TB disks weren’t even mounted, they were spinning. Putting them all in “standby” mode with sudo hdparm -y lowered the total system load to just 82W. So for infrequently accessed “cold storage”, there’s even more room for optimization—although it might be tricky to leverage standby mode in practice with a system such as Ceph. At least scrubbing strategy would need to be adapted, I guess.