Databases as State

Luddy Harrison, July 26, 2023

This is part 3 in our Introduction series

A Brief History of Databases in the Cloud

Databases have evolved alongside cloud computing, and the developments on one side of this divide have informed and shaped developments on the other side. Some of these developments, like the rise of non-SQL databases, are likely already familiar to us. Others however are less obvious, but are at the same time an inevitable consequence of the technical direction that cloud computing has taken.

Before getting to that, let's review the most visible change in database technology that has happened in the past couple decades.

In the beginning was the relational database. The relational database stored neatly organized data in a normalized form, that economized physical storage and supported flexible queries. The fundamental operations that gave these databases their power were the join and the pivot. Without going into how exactly these work — this is a rich and mature literature on this subject — we can remark that both of them are powerful and useful operations, but they weren't designed to scale up to a very large number of concurrent database queries.

Relational databases were enormously successful and are still used heavily. But as serverless computation became more popular, pressure mounted on these databases and their performance. Serverless computing is an extreme form of distributed computing. Would-be clients of a database are not running on the server that operates the database. Building an efficient client-server architecture around a database must begin by disentangling individual accesses from one another so far as possible, and minimizing the number of disk accesses required to support each query. The disk accesses required for joins and pivots clashes with this requirement. In contrast, a simple key-value store fits the bill perfectly. We already know a lot about implementing associative memory efficiently, and key-value stores are just a form of associative memory. Various alternatives to relational databases sprung up, beginning with variations on the key-value store, optionally with extensions for compound keys. Many other kinds of NoSQL database designs have appeared however. This is a good overview.

From this telling, it looks as though modern database design was driven mostly by performance and scaling pressures on relational databases, created by cloud computing and serverless computing particular. The evolution of database design changed how databases are organized physically, but didn't fundmamentally change what they are used for. There is another dimension to the story, however, and understanding it helps a lot when thinking about the architecture of cloud back ends.

Serverless Computing and the Problem of State

As we remarked in the first article in this series, the various kinds of compute available in the cloud are basically distinguished by how long they run before they vanish. Serverless functions are one endpoint on this spectrum: in the limit, they vanish as soon as they respond a single request, and any state they have built up vanishes with them.

Let's take a simple example. Suppose that you want to build a cloud back end that manages stock trades. Buyers and sellers send requests to the back end, with offers to buy or sell shares of a certain stock at a certain price. The back end matches these buyers and sellers, consumating transactions.

If we do this in a server, a simple design keeps a table in memory that maps stock symbols to lists of pending buy and sell orders for that stock symbol. When a new buy or sell request comes along, we enter it into the table and process a trade if a suitable matching sell or buy order is already in the table. Ignoring scale, reliability, latency, etc., this is fundamentally an easy thing to do. And the reason it is easy is that the memory of the server makes it easy to maintain a table of pending buy and sell request.

Here's a picture of what this might look like:

This table is what we might call the state of the server. State is a somewhat slippery notion. It includes both in-memory structures like the table in this example, but also structures on permanent, non-volatile storage like files and databases. It includes items that are physically near to the CPU we're currently running on, and data that is in a remote location. But this example focuses on in-memory state because it's the tough nut that must be cracked if we want to move back and forth between servers and serverless execution for an application.

To see this, let's consider what happens if we try to create a serverless back end for this application. Let's just copy the server design blindly and see what happens. In comes a buy or sell request, we enter it into the table, but since this is a brand new instance of our serverless function, the table is initially empty. We enter the single quote we have received into it, and then return a response. The table disappears, and the next request that comes along runs in another instance of the serverless function, that likewise starts with an empty table. This will clearly not work.

We can make the following generalization which, even though it sounds quite sweeping, is strikingly accurate:

The handling of state is the most important difference in expressive power between servers and serverless computing models.

Servers can build up state in memory, whereas serverless functions cannot¹.

Databases as Serverless State

As noted above, databases evolved from relational to key-value stores in response mostly to performance and scaling pressures created by cloud computing and serverless computing in particular. This evolution changed how databases are organized but not their basic purpose.

But it turns that we have a new problem for which the modern key-value database might be a good solution! We would like to create serverless back ends, perhaps like the stock trading one above, that require persistent state. The in-memory solution we used when implementing this back end using a server is a key-value store (a table in which stock symbols are the key and pending trades are the values).

Let's think through the serverless design of this back end again, with this new tool in mind.

An serverless function is run in response to a buy or sell request. The table of pending trades is now held in a key-value database. In a single transaction (ideally), we add the new trade to this table. Either in the same serverless function, or a downstream process of some kind, we process the pending trades and resolve them by matching buyers and sellers as we did in the server, but using the database as the repository of pending orders. The the result might look something like this:

Database As State With Serverless Functions

In this figure we've shown each order as a record (row) in the database, so that retrieving the orders for a single symbol means retreiving multiple rows. This is a common arrangement that some modern database designs support especially well.

A database that is used to represent state and to coordinate updates to that state in a serverless setting will typically be updated (written) very frequently, perhaps as frequently as it is queried (read).

There is of course a very real question of performance and cost here. It might be (and likely is) that for various reasons, this is not a competitive design compared to a server-based implementation. In any case, too many details that are required to have a realistic trading implementation have been omitted here. But the important point, one that transfers to many problems and domains in cloud computing, is that it is possible to create persistent state that is updated frequently using a high-performance key-value database. If not in precisely the example case we have chosen, nevertheless for many cloud back ends, this shift of state from the memory a server into a modern database is in fact a good tradeoff. And sometimes this is for reasons that go beyond the basic tradeoffs of servers versus serverless. The database representation of the state of pending orders is for example more robust against failures than the RAM-based table in a server. It also provides persistent storage and a natural communcation point to other parts of the back end, for the orders that are processed. What begins as a way to increase the power and capability of serverless functions, can in fact give benefits of other kinds in compensation.

This is overstated a bit. Warm-start of serverless functions does in fact allow state to persist in memory across invocations of serverless functions, and this state can be very useful. It is possible for example to do limited forms of caching in a serverless function, and it can be quite effective if instances of the function are warm started over a long enough period before being dropped and forced to cold-start. Similarly, invariants that will be used by all the instances of a serverless function can be computed once at warm-start and re-used. Nevertheless, the capricious nature of warm- versus cold-start, and the fact that separate instances of a serverless function don't share the same memory state, limits the uses we can make of this idea. ↩