For several years we have been running OpenStack and Ceph clusters as part of SWITCHengines, an IaaS offering for the Swiss academic community. Initially, our main “job” for Ceph was to provide scalable block storage for OpenStack VMs—which it does quite well. But we also provided S3 (and Swift, but that’s outside the scope of this post) -based object storage via RadosGW from early on. This easy-to-use object storage turned out to be popular far beyond our initial expectations.
One valuable feature of RadosGW is that it integrates with Keystone, the Authentication and Authorization service in OpenStack. This meant that any user of our OpenStack offering can create, within her Project/tenant, EC2-compatible credentials to set up, and manage access to, S3 object store buckets. And they sure did! SWITCHengines users started to use our object store to store videos (and stream them directly from our object store to users’ browsers), research data for archival and dissemination, external copies from (parts of) their enterprise backup systems, and presumably many other interesting things; a “defining characteristic” of the cloud is that you don’t have to ask for permission (see “On-demand self-service” in the NIST Cloud definition)—though as a community cloud provider, we are happy to hear about, and help with, specific use cases.
Now this sounds pretty close to cloud nirvana, but… there was a problem: Each time a client made an authenticated (signed) S3 request on any bucket, RadosGW had to outsource the validation of the request signature to Keystone, which would return either the identity of the authenticated user (that RadosGW could then use for authorization purposes), or a negative reply in case the signature doesn’t validate. Unfortunately, this outsourced signature validation process turns out to bring significant per-request overhead. In fact, for “easy” requests such as reading and writing small objects, this authentication overhead easily dominates total processing time. For a sense of the magnitude, small requests without Keystone validation often take <10ms to complete (according to the logs of our NGinx-based HTTPS server that acts as a front end to the RadosGW nodes. Whereas any request involving Keystone takes at least 600ms.
One undesirable effect is that our users probably wonder why simple requests have such a high baseline response time. Transfers of large objects don’t care much, because at some point the processing time is dominated by Rados/network transfer time of user data.
But an even worse effect is that S3 users could, by using client software that “aggressively” exploited parallelism, put very high load on our Keystone service, to the point that OpenStack operations sometimes ran into timeouts when they needed to use the authentication/authorization service.
In our struggle to cope with this reoccurring issue, we found a somewhat ugly workaround: When we found a EC2 credential in Keystone whose use in S3/RadosGW contributed significant load, we extracted that credential (basically an ID/secret pair) from Keystone, and provisioned it locally in all of our RadosGW instances. This always solved the individual performance problem for that client, response times dropped by 600ms immediately, and load on our Keystone system subsided.
While the workaround fixed our immediate troubles, it was deeply unsatisfying in several ways:
- Need to identify “problematic” S3 uses that caused high Keystone load
- Need to (more or less manually) re-provision Keystone credentials in RadosGW
- Risk of “credential drift” in case the Keystone credentials changed (or disappeared) after their re-provisioning in RadosGW—the result would be that clients would still be able to access resources that they shouldn’t (anymore).
But the situation was bearable for us, and we basically resigned to having to fix performance emergencies every once in a while until maybe one day, someone would write a Python script or something that would synchronize EC2 credentials between Keystone and RadosGW…
PR #26095: A New Hope
But then out of the blue, James Weaver from the BBC contributed PR #26095, rgw: Added caching for S3 credentials retrieved from keystone. This changes the approach to signature validation when credentials are found in Keystone: The key material (including secret key) found in Keystone is cached by RadosGW, and RadosGW always performs signature validation locally.
James’s change was merged into master and will presumably come out with the “O” release of Ceph. We run Nautilus, and when we got wind of this change, we were excited to try it out. We had some discussions as to whether the patch might be backported to Nautilus; in the end we considered that unlikely at the current state, because the patch unconditionally changes the behavior in a way that could violate some security assumptions (e.g. that EC2 secrets would never leave Keystone).
We usually avoid carrying local patches, but in this case we were sufficiently motivated to go and cherry-pick the change on top of the version we were running (initially v14.2.5, later v14.2.6 and v14.2.7). We basically followed the instructions on how to build Ceph, but after cloning the Ceph repo, ran
git checkout v14.2.7 git cherry-pick affb7d396f76273e885cfdbcd363c1882496726c -m 1 -v edit debian/changelog and prepend: ceph (14.2.7-1bionic-switch1) stable; urgency=medium * Cherry-picked upstream pull #26095: rgw: Added caching for S3 credentials retrieved from keystone -- Simon Leinen <email@example.com> Thu, 01 Feb 2020 19:51:21 +0000
Then, dpkg-buildpackage and wait for a couple of hours…
We tested the resulting RadosGW package in our staging environment for a couple of days before trying them in our production clusters.
When we activated the patched RadosGW in production, the effects were immediately visible: The CPU load of our Keystone system went down by orders of magnitude.
On 2020-01-27 at around 08:00, we upgraded our first production cluster’s RadosGWs. Twenty-four hours later, we upgraded the RadosGWs on the second cluster. The baseline load on our Keystone service dropped visibly on the first upgrade, but some high load peaks could still be seen. Since the second region was upgraded, no sharp peaks anymore. There is a periodic load increase every night between 03:10 and 04:10, presumably due to some charging/accounting system doing its thing. Probably these peaks were “always” there, but they only became apparent once we started deploying the credential-caching code.
The 95th-percentile latency of “small” requests (defined as both $body_bytes_sent and $request_length being lower than 65536 was reduced from ~750ms to ~100ms:
Conclusion and Outlook
We owe the BBC a beer.
To make the patch perfect, maybe it would be cool to limit the lifetime of cached credentials to some reasonable value such as a few hours. This could limit the damage in case credentials should be invalidated. Though I guess you could just restart all RadosGW processes and lose any cached credentials immediately.
If you are interested in using our RadosGW packages made from cherry-picking PR #20965 on top of Nautilus, please contact us. Note that we only have x86_64 packages for Ubuntu 18.04 “Bionic” GNU/Linux.