Low cost file serving on bare metal with tiered storage

2018-05-13

I once worked a project that stored and pushed hundreds of TBs a month, and cost was an important consideration.

AWS S3 is cheap-ish, until you get into multi TB and PB territory.
AWS bandwidth has always been outrageously expensive.
Google Cloud was cheaper, but only by some factor like 'half the price of AWS.'

Keeping costs low involved custom building servers by hand, with RAID storage and off the shelf consumer components. We colocated them in various datacenters around the US. This was bare metal world, and a back of the napkin calculation showed TCO was between 10X and 30X cheaper than a naïve AWS implementation, depending on assumptions.

As we grew, one thing that became an obvious problem was trying to distribute service load between multiple distinct servers. Once there were more than a handful, some got saturated with requests more than others, and the data was effectively sharded (you had to request your file from the right server, or the request would fail). At the peak of this, there were close to a dozen heterogeneous boxes - more like pets than cattle.

The other problem we had was that the servers tended to need very large RAID arrays to hold enough files to keep the CPU and network saturated (at the time, 10gbps was becoming commonplace). This was expensive, and the RAID gave us good data safety guarantees, but the servers themselves were a single point of failure for high availability — there were no hot spare servers at first, since the architecture described so far would've required doubling the server count.

Since having offsite backups was important, we were storing a copy of all files in a separate location anyway. This gave us an idea.

We ended up implementing a solution where the backup service actually served as a warm data store for the servers. Whenever a server ran low on disk space, it would 'evict' some of its files in local storage, similar to how an operating system might page out memory. Whenever a request came in for a file it didn't have locally, it would go fetch it from the warm store / backup, then serve it. It seemed like an elegant solution.

In practice, this had a number of interesting and unforeseen complications, while other details I obsessed about ended up being nonissues. First, I worried about the eviction algorithm. Keeping a list of least-recently-used files seemed like an obvious idea, but this was expensive to create and update (as most true LRU algorithms are). Using a naïve random algorithm actually worked great, as long as the files were 'aged' - files that were at least 2 months old worked perfectly. The reason is that repeated runs of a very infrequent random eviction tends to perform similarly to LRU in practice, since popular files almost immediately reappear in the hot cache, and infrequently accessed files just stay in the warm store.

Version 1 was very, very, deeply problematic. Serving an evicted file seems relatively straightforward - it simply requires fetching it from the store first. The problem with this design is it doesn't tell us what happens when multiple requests come in for the same file. A naïve implementation would end up making parallel requests for the same file from backup, needlessly hogging resources at the precise moment we wanted to complete a request as quickly as possible.

In Version 2, a mutex solved this problem, but created another one: massive unnecessary queuing. It turns out that popular files by definition generate a lot of requests, and serializing those requests even for a few seconds while a file is restored could very quickly lead to resource exhaustion. Compounding the problem was that the server was a rails server, and a process-based rails server at that.

Some background: As a very efficient file server, this rails app only takes a few milliseconds to do some bookkeeping (checking permissions, updating analytics, and doing a quick spam/abuse check) and then hands it off to nginx to actually transmit the requested file. Each rails worker was a separate process, which is very RAM inefficient [1]. We had plenty of ram to handle those requests in rails, and plenty more to run nginx, with no queuing. So using a RAM inefficient process model for rails turned out not to matter in the usual case. But it mattered a lot if requests could occasionally spike to 3-5 seconds in rails, causing a 100RPS server to dip to 0RPS for a few seconds at a time because all the available processes are busy waiting on a lock.

The solution I implemented in Version 3 was to make the entire architecture non-blocking. If a ruby process failed to acquire a lock, it meant a different process was in the middle of restoring the file, so it would simply fail the request (I forget what status code, maybe 503?). To the end-user, retrying the download in a few seconds would succeed (our users were very forgiving). Lesson learned: there's a reason rails defaults to a shared-nothing architecture in its web request-response model. In practice, failed requests were exceptionally rare, and was much better than the alternative of causing random latency and availability spikes. This design scaled really well, and is an example of an 80/20 solution that beat out more complicated architectures.

There were other complications. A file is 'paged in' by writing it directly to its final destination[2]. This leads to potential integrity issues - what happens if ruby is killed during a restart before it has finished writing a file? What if there's a network hiccup? It was possible that this could cause a partially downloaded file to be served on restart. This was solved by checking the filesize and treating it like a cache miss if the size on disk didn't match the size in the database.

There was also the (extremely) rare file corruption that would sometimes happen, so the file hashes were occasionally compared to known good values in the database. But hashing the file before serving it every time was expensive and unnecessary, so instead it was simply run with a fixed probability on every request. A hash mismatch is treated as a cache miss, and the file is retrieved from backup. This was admittedly a band-aid, with a number of limitations. It checks the integrity of popular files too frequently, and not frequently enough for files that aren't accessed frequently. If the file is also corrupted in storage, this would lead to an irreparable situation, and a never-ending cache miss loop. This risk was judged to be low enough to ignore, and luckily, I don't know of any instances where it ever happened.

This tiered data storage solution worked so well that it ended up being used for load balancing as well. No longer was a file pinned to a specific server - it could be requested from any server, and if the server didn't have it, it could simply treat the situation like a cache miss. This greatly simplified the network and application architecture, since it now meant load could be round-robined, and all the servers could be treated as indistinguishable from each other. Spreading load more evenly also means there is less unused capacity, further lowering both server capex and opex.

I'm pretty sure we ended up achieving the 30X cost reduction.

[1] Puma, a threaded ruby server, wasn't available when this app was first built.

[2] Yes, I know the correct answer is usually to write the file to a tmp location, and then rename when it's finished, but I no longer remember the precise reason we didn't do this. It might have been because tmp was a different filesystem, which means 'mv' required physical copying, not simply renaming the file.