This paper is included in the Proceedings of the 2022 USENIX

This paper is included in the Proceedings of the

2022 USENIX Annual Technical Conference.

July 11–13, 2022 • Carlsbad, CA, USA

978-1-939133-29-8

Open access to the Proceedings of the

2022 USENIX Annual Technical Conference

is sponsored by

Amazon DynamoDB: A Scalable, Predictably

Performant, and Fully Managed NoSQL

Database Service

Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard

Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam ,Tim Rath,

Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul,

Doug Terry, Akshat Vig, Amazon Web Services

https://www.usenix.org/conference/atc22/presentation/vig

Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed

NoSQL Database Service

Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog

Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam ,Tim Rath

Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, Akshat Vig

[email protected]

Amazon Web Services

Abstract

Amazon DynamoDB is a NoSQL cloud database service that

provides consistent performance at any scale. Hundreds of

thousands of customers rely on DynamoDB for its fundamen-

tal properties: consistent performance, availability, durability,

and a fully managed serverless experience. In 2021, during the

66-hour Amazon Prime Day shopping event, Amazon systems

- including Alexa, the Amazon.com sites, and Amazon fulﬁll-

ment centers, made trillions of API calls to DynamoDB, peak-

ing at 89.2 million requests per second, while experiencing

high availability with single-digit millisecond performance.

Since the launch of DynamoDB in 2012, its design and im-

plementation have evolved in response to our experiences

operating it. The system has successfully dealt with issues

related to fairness, trafﬁc imbalance across partitions, moni-

toring, and automated system operations without impacting

availability or performance. Reliability is essential, as even

the slightest disruption can signiﬁcantly impact customers.

This paper presents our experience operating DynamoDB at

a massive scale and how the architecture continues to evolve

to meet the ever-increasing demands of customer workloads.

1 Introduction

Amazon DynamoDB is a NoSQL cloud database service that

supports fast and predictable performance at any scale. Dy-

namoDB is a foundational AWS service that serves hundreds

of thousands of customers using a massive number of servers

located in data centers around the world. DynamoDB powers

multiple high-trafﬁc Amazon properties and systems includ-

ing Alexa, the Amazon.com sites, and all Amazon fulﬁllment

centers. Moreover, many AWS services such as AWS Lambda,

AWS Lake Formation, and Amazon SageMaker are built on

DynamoDB, as well as hundreds of thousands of customer

applications.

These applications and services have demanding opera-

tional requirements with respect to performance, reliability,

durability, efﬁciency, and scale. The users of DynamoDB rely

on its ability to serve requests with consistent low latency. For

DynamoDB customers, consistent performance at any scale

is often more important than median request service times be-

cause unexpectedly high latency requests can amplify through

higher layers of applications that depend on DynamoDB and

lead to a bad customer experience. The goal of the design of

DynamoDB is to complete all requests with low single-digit

millisecond latencies. In addition, the large and diverse set

of customers who use DynamoDB rely on an ever-expanding

feature set as shown in Figure 1. As DynamoDB has evolved

over the last ten years, a key challenge has been adding fea-

tures without impacting operational requirements. To beneﬁt

customers and application developers, DynamoDB uniquely

integrates the following six fundamental system properties:

DynamoDB is a fully managed cloud service. Using the

DynamoDB API, applications create tables and read and write

data without regard for where those tables are stored or how

they’re managed. DynamoDB frees developers from the bur-

den of patching software, managing hardware, conﬁguring a

distributed database cluster, and managing ongoing cluster

operations. DynamoDB handles resource provisioning, au-

tomatically recovers from failures, encrypts data, manages

software upgrades, performs backups, and accomplishes other

tasks required of a fully-managed service.

DynamoDB employs a multi-tenant architecture. Dy-

namoDB stores data from different customers on the same

physical machines to ensure high utilization of resources, en-

abling us to pass the cost savings to our customers. Resource

reservations, tight provisioning, and monitored usage provide

isolation between the workloads of co-resident tables.

DynamoDB achieves boundless scale for tables. There

are no predeﬁned limits for the amount of data each table

can store. Tables grow elastically to meet the demand of the

customers’ applications. DynamoDB is designed to scale the

resources dedicated to a table from several servers to many

thousands as needed. DynamoDB spreads an application’s

data across more servers as the amount of data storage and

the demand for throughput requirements grow.

DynamoDB provides predictable performance. The simple

USENIX Association 2022 USENIX Annual Technical Conference 1037

Figure 1: DynamoDB yimeline

DynamoDB API with GetItem and PutItem operations allows

it to respond to requests with consistent low latency. An ap-

plication running in the same AWS Region as its data will

typically see average service-side latencies in the low single-

digit millisecond range for a 1 KB item. Most importantly,

DynamoDB latencies are predictable. Even as tables grow

from a few megabytes to hundreds of terabytes, latencies re-

main stable due to the distributed nature of data placement

and request routing algorithms in DynamoDB. DynamoDB

handles any level of trafﬁc through horizontal scaling and

automatically partitions and re-partitions data to meet an ap-

plication’s I/O performance requirements.

DynamoDB is highly available. DynamoDB replicates data

across multiple data centers—called Availability Zones in

AWS—and automatically re-replicates in the case of disk

or node failures to meet stringent availability and durability

requirements. Customers can also create global tables that are

geo-replicated across selected Regions for disaster recovery

and provide low latency access from anywhere. DynamoDB

offers an availability SLA of 99.99 for regular tables and

99.999 for global tables (where DynamoDB replicates across

tables across multiple AWS Regions).

DynamoDB supports ﬂexible use cases. DynamoDB

doesn’t force developers into a particular data model or con-

sistency model. DynamoDB tables don’t have a ﬁxed schema

but instead allow each data item to contain any number of at-

tributes with varying types, including multi-valued attributes.

Tables use a key-value or document data model. Developers

can request strong or eventual consistency when reading items

from a table.

In this paper, we describe how DynamoDB evolved as a

distributed database service to meet the needs of its customers

without losing its key aspect of providing a single-tenant ex-

perience to every customer using a multi-tenant architecture.

The paper explains the challenges faced by the system and

how the service evolved to handle those challenges while con-

necting the required changes to a common theme of durability,

availability, scalability, and predictable performance.

The paper captures the following lessons that we have

learnt over the years

•

Adapting to customers’ trafﬁc patterns to reshape the

physical partitioning scheme of the database tables im-

proves customer experience.

•

Performing continuous veriﬁcation of data-at-rest is a

reliable way to protect against both hardware failures

and software bugs in order to meet high durability goals.

•

Maintaining high availability as a system evolves re-

quires careful operational discipline and tooling. Mech-

anisms such as formal proofs of complex algorithms,

game days (chaos and load tests), upgrade/downgrade

tests, and deployment safety provides the freedom to

safely adjust and experiment with the code without the

fear of compromising correctness.

•

Designing systems for predictability over absolute ef-

ﬁciency improves system stability. While components

such as caches can improve performance, do not allow

them to hide the work that would be performed in their

absence, ensuring that the system is always provisioned

to handle the unexpected.

The structure of this paper is as follows: Section 2 expands

on the history of DynamoDB and explains its origins, which

derive from the original Dynamo system. Section 3 presents

the architectural overview of DynamoDB. Section 4 covers

the journey of DynamoDB from provisioned to on-demand

tables. Section 5 covers how DynamoDB ensures strong dura-

bility. Section 6 describes the availability challenges faced

and how these challenges were handled. Section 7 provides

some experimental results based on the Yahoo! Cloud Serving

Benchmark (YCSB) benchmarks, and Section 8 concludes

the paper.

2 History

The design of DynamoDB was motivated by our experiences

with its predecessor Dynamo [9], which was the ﬁrst NoSQL

database system developed at Amazon. Dynamo was created

in response to the need for a highly scalable, available, and

durable key-value database for shopping cart data. In the early

years, Amazon learned that providing applications with direct

access to traditional enterprise database instances led to scal-

ing bottlenecks such as connection management, interference

between concurrent workloads, and operational problems with

tasks such as schema upgrades. Thus, a service-oriented ar-

chitecture was adopted to encapsulate an application’s data

1038 2022 USENIX Annual Technical Conference USENIX Association

behind service-level APIs that allowed sufﬁcient decoupling

to address tasks like reconﬁguration without having to disrupt

clients.

High availability is a critical property of a database ser-

vice as any downtime can impact customers that depend on

the data. Another critical requirement for Dynamo was pre-

dictable performance so that applications could provide a

consistent experience to their users. To achieve these goals,

Amazon had to start from ﬁrst principles when designing

Dynamo. The adoption of Dynamo widened to serve several

use cases within Amazon because it was the only database

service that provided high reliability at scale. However, Dy-

namo still carried the operational complexity of self-managed

large database systems. Dynamo was a single-tenant system

and teams were responsible for managing their own Dynamo

installations. Teams had to become experts on various parts of

the database service and the resulting operational complexity

became a barrier to adoption.

During this period, Amazon launched new services (no-

tably Amazon S3 and Amazon SimpleDB) that focused on a

managed and elastic experience in order to remove this op-

erational burden. Amazon engineers preferred to use these

services instead of managing their own systems like Dynamo,

even though the functionality of Dynamo was often better

aligned with their applications’ needs. Managed elastic ser-

vices freed developers from administrating databases and

allowed them to focus on their applications.

The ﬁrst database-as-a-service from Amazon was Sim-

pleDB [1], a fully managed elastic NoSQL database service.

SimpleDB provided multi-datacenter replication, high avail-

ability, and high durability without the need for customers to

set up, conﬁgure, or patch their database. Like Dynamo, Sim-

pleDB also had a very simple table interface with a restricted

query set that served as a building block for many developers.

While SimpleDB was successful and powered many applica-

tions, it had some limitations. One limitation was that tables

had a small capacity in terms of storage (10GB) and of request

throughput. Another limitation was the unpredictable query

and write latencies, which stemmed from the fact that all table

attributes were indexed, and the index needed to be updated

with every write. These limitations created a new kind of

operational burden for developers. They had to divide data

between multiple tables to meet their application’s storage

and throughput requirements.

We realized that the goal of removing the limitations of Sim-

pleDB and providing a scalable NoSQL database service with

predictable performance could not be met with the SimpleDB

APIs. We concluded that a better solution would combine the

best parts of the original Dynamo design (incremental scala-

bility and predictable high performance) with the best parts of

SimpleDB (ease of administration of a cloud service, consis-

tency, and a table-based data model that is richer than a pure

key-value store). These architectural discussions culminated

in Amazon DynamoDB, a public service launched in 2012

Operation Description

PutItem

Inserts a new item, or replaces an old item

with a new item.

UpdateItem

Updates an existing item, or adds a new

item to the table if it doesn’t already exist.

DeleteItem

The DeleteItem operation deletes a single

item from the table by the primary key.

GetItem

The GetItem operation returns a set of at-

tributes for the item with the given primary

key.

Table 1: DynamoDB CRUD APIs for items

that shared most of the name of the previous Dynamo system

but little of its architecture. Amazon DynamoDB was the

result of everything we’d learned from building large-scale,

non-relational databases for Amazon.com and has evolved

based on our experiences building highly scalable and reliable

cloud computing services at AWS.

3 Architecture

A DynamoDB table is a collection of items, and each item is

a collection of attributes. Each item is uniquely identiﬁed by

a primary key. The schema of the primary key is speciﬁed at

the table creation time. The primary key schema contains a

partition key or a partition and sort key (a composite primary

key). The partition key’s value is always used as an input to

an internal hash function. The output from the hash function

and the sort key value (if present) determines where the item

will be stored. Multiple items can have the same partition key

value in a table with a composite primary key. However, those

items must have different sort key values.

DynamoDB also supports secondary indexes to provide

enhanced querying capability. A table can have one or more

secondary indexes. A secondary index allows querying the

data in the table using an alternate key, in addition to queries

against the primary key. DynamoDB provides a simple in-

terface to store or retrieve items from a table or an index.

Table 1 contains the primary operations available to clients

for reading and writing items in DynamoDB tables. Any oper-

ation that inserts, updates, or deletes an item can be speciﬁed

with a condition that must be satisﬁed in order for the op-

eration to succeed. DynamoDB supports ACID transactions

enabling applications to update multiple items while ensuring

atomicity, consistency, isolation, and durability (ACID) across

items without compromising the scalability, availability, and

performance characteristics of DynamoDB tables.

A DynamoDB table is divided into multiple partitions to

handle the throughput and storage requirements of the table.

Each partition of the table hosts a disjoint and contiguous part

of the table’s key-range. Each partition has multiple replicas

distributed across different Availability Zones for high avail-

USENIX Association 2022 USENIX Annual Technical Conference 1039

Figure 2: Storage replica on a storage node

Figure 3: Log replica on a log node

ability and durability. The replicas for a partition form a repli-

cation group. The replication group uses Multi-Paxos [14]

for leader election and consensus. Any replica can trigger a

round of the election. Once elected leader, a replica can main-

tain leadership as long as it periodically renews its leadership

lease.

Only the leader replica can serve write and strongly consis-

tent read requests. Upon receiving a write request, the leader

of the replication group for the key being written generates

a write-ahead log record and sends it to its peer (replicas). A

write is acknowledged to the application once a quorum of

peers persists the log record to their local write-ahead logs.

DynamoDB supports strongly and eventually consistent reads.

Any replica of the replication group can serve eventually con-

sistent reads. The leader of the group extends its leadership

using a lease mechanism. If the leader of the group is failure

detected (considered unhealthy or unavailable) by any of its

peers, the peer can propose a new round of the election to

elect itself as the new leader. The new leader won’t serve

any writes or consistent reads until the previous leader’s lease

expires.

A replication group consists of storage replicas that contain

both the write-ahead logs and the B-tree that stores the key-

value data as shown in Figure 2. To improve availability and

durability, a replication group can also contain replicas that

only persist recent write-ahead log entries as shown in Figure

3. These replicas are called

log replicas

. Log replicas are

akin to acceptors in Paxos. Log replicas do not store key-

value data. Section 5 and 6 discusses how

log replicas

help DynamoDB improve its availability and durability.

DynamoDB consists of tens of microservices. Some of the

core services in DynamoDB are the metadata service, the

Figure 4: DynamoDB architecture

request routing service, the storage nodes, and the autoadmin

service, as shown in Figure 4. The metadata service stores

routing information about the tables, indexes, and replication

groups for keys for a given table or index. The request rout-

ing service is responsible for authorizing, authenticating, and

routing each request to the appropriate server. For example,

all read and update requests are routed to the storage nodes

hosting the customer data. The request routers look up the

routing information from the metadata service. All resource

creation, update, and data deﬁnition requests are routed to

the autoadmin service. The storage service is responsible for

storing customer data on a ﬂeet of storage nodes. Each of the

storage nodes hosts many replicas of different partitions.

The autoadmin service is built to be the central nervous sys-

tem of DynamoDB. It is responsible for ﬂeet health, partition

health, scaling of tables, and execution of all control plane

requests. The service continuously monitors the health of

all the partitions and replaces any replicas deemed unhealthy

(slow or not responsive or being hosted on bad hardware). The

service also performs health checks of all core components

of DynamoDB and replaces any hardware that is failing or

has failed. For example, if the autoadmin service detects a

storage node to be unhealthy, it kicks off a recovery process

that replaces the replicas hosted on that node to bring the

system back to a stable state.

Other DynamoDB services not shown in in Figure 4 sup-

port features such as point-in-time restore, on-demand back-

ups, update streams, global admission control, global tables,

global secondary indices, and transactions.

4 Journey from provisioned to on-demand

When DynamoDB launched, we introduced an internal ab-

straction, partitions, as a way to dynamically scale both the ca-

pacity and performance of tables. In the original DynamoDB

release, customers explicitly speciﬁed the throughput that a

table required in terms of read capacity units (RCUs) and

write capacity units (WCUs). For items up to 4 KB in size,

one RCU can perform one strongly consistent read request

per second. For items up to 1 KB in size, one WCU can

1040 2022 USENIX Annual Technical Conference USENIX Association

perform one standard write request per second. RCUs and

WCUs collectively are called provisioned throughput. The

original system split a table into partitions that allow its con-

tents to be spread across multiple storage nodes and mapped

to both the available space and performance on those nodes.

As the demands from a table changed (because it grew in

size or because the load increased), partitions could be further

split and migrated to allow the table to scale elastically. Par-

tition abstraction proved to be really valuable and continues

to be central to the design of DynamoDB. However this early

version tightly coupled the assignment of both capacity and

performance to individual partitions, which led to challenges.

DynamoDB uses admission control to ensure that storage

nodes don’t become overloaded, to avoid interference between

co-resident table partitions, and to enforce the throughput lim-

its requested by customers. Admission control in DynamoDB

has evolved over the past decade. Admission control was the

shared responsibility of all storage nodes for a table. Stor-

age nodes independently performed admission control based

on the allocations of their locally stored partitions. Given

that a storage node hosts partitions from multiple tables, the

allocated throughput of each partition was used to isolate

the workloads. DynamoDB enforced a cap on the maximum

throughput that could be allocated to a single partition, and

ensured that the total throughput of all the partitions hosted

by a storage node is less than or equal to the maximum al-

lowed throughput on the node as determined by the physical

characteristics of its storage drives.

The throughput allocated to partitions was adjusted when

the overall table’s throughput was changed or its partitions

were split into child partitions. When a partition was split

for size, the allocated throughput of the parent partition was

equally divided among the child partitions. When a partition

was split for throughput, the new partitions were allocated

throughput based on the table’s provisioned throughput. For

example, assume that a partition can accommodate a maxi-

mum provisioned throughput of 1000 WCUs. When a table

is created with 3200 WCUs, DynamoDB created four parti-

tions that each would be allocated 800 WCUs. If the table’s

provisioned throughput was increased to 3600 WCUs, then

each partition’s capacity would increase to 900 WCUs. If the

table’s provisioned throughput was increased to 6000 WCUs,

then the partitions would be split to create eight child parti-

tions, and each partition would be allocated 750 WCUs. If

the table’s capacity was decreased to 5000 WCUs, then each

partition’s capacity would be decreased to 675 WCUs.

The uniform distribution of throughput across partitions

is based on the assumptions that an application uniformly

accesses keys in a table and the splitting a partition for size

equally splits the performance. However, we discovered that

application workloads frequently have non-uniform access

patterns both over time and over key ranges. When the request

rate within a table is non-uniform, splitting a partition and

dividing performance allocation proportionately can result

in the hot portion of the partition having less available per-

formance than it did before the split. Since throughput was

allocated statically and enforced at a partition level, these non-

uniform workloads occasionally resulted in an application’s

reads and writes being rejected, called throttling, even though

the total provisioned throughput of the table was sufﬁcient to

meet its needs.

Two most commonly faced challenges by the applications

were: hot partitions and throughput dilution. Hot partitions

arose in applications that had trafﬁc going consistently to-

wards a few items of their tables. The hot items could belong

to a stable set of partitions or could hop around to different

partitions over time. Throughput dilution was common for

tables where partitions were split for size. Splitting a partition

for size would cause the throughput of the partition to be

divided equally among the newly created child partitions, and

hence the per partition throughput would decrease.

In both cases, from the customer’s perspective, throttling

caused their application to experience periods of unavailabil-

ity even though the service was behaving as expected. Cus-

tomers who experienced throttling would work around it by

increasing a table’s provisioned throughput and not use all

the capacity. That is, tables would be over-provisioned. While

this allowed them to achieve the performance they needed, it

was a poor experience because it was difﬁcult to estimate the

right level of performance provisioning for their tables.

4.1 Initial improvements to admission control

As we mentioned at the start of this section, hot partitions and

throughput dilution stemmed from tightly coupling a rigid

performance allocation to each partition, and dividing that

allocation as partitions split. We liked that enforcing alloca-

tions at an individual partition level avoided the need for the

complexities of distributed admission control, but it became

clear these controls weren’t sufﬁcient. Shortly after launch-

ing, DynamoDB introduced two improvements, bursting and

adaptive capacity, to address these concerns.

4.1.1 Bursting

The key observation that partitions had non-uniform access

also led us to observe that not all partitions hosted by a storage

node used their allocated throughput simultaneously. Hence,

to absorb temporal spikes in workloads at a partition level, Dy-

namoDB introduced the concept of bursting. The idea behind

bursting was to let applications tap into the unused capacity

at a partition level on a best effort basis to absorb short-lived

spikes. DynamoDB retained a portion of a partition’s unused

capacity for later bursts of throughput usage for up to 300

seconds and utilized it when consumed capacity exceeded the

provisioned capacity of the partition. The unused capacity is

called burst capacity.

DynamoDB still maintained workload isolation by ensuring

USENIX Association 2022 USENIX Annual Technical Conference 1041

that a partition could only burst if there was unused throughput

at the node level. The capacity was managed on the storage

node using multiple token buckets: two for each partition

(allocated and burst) and one for the node. These buckets

provided admission control. When a read or write request

arrived on a storage node, if there were tokens in the parti-

tion’s allocated token bucket, then the request was admitted

and tokens were deducted from the partition and node level

bucket. Once a partition had exhausted all the provisioned

tokens, requests were allowed to burst only when tokens were

available both in the burst token bucket and the node level

token bucket. Read requests were accepted based on the local

token buckets. Write requests using burst capacity required

an additional check on the node-level token bucket of other

member replicas of the partition. The leader replica of the

partition periodically collected information about each of the

members node-level capacity. In section 4.3 we explain how

we increased a node’s ability to burst.

4.1.2 Adaptive capacity

DynamoDB launched adaptive capacity to better absorb long-

lived spikes that cannot be absorbed by the burst capacity.

Adaptive capacity allowed DynamoDB to better absorb work-

loads that had heavily skewed access patterns across parti-

tions. Adaptive capacity actively monitored the provisioned

and consumed capacity of all the tables. If a table experienced

throttling and the table level throughput was not exceeded,

then it would automatically increase (boost) the allocated

throughput of the partitions of the table using a proportional

control algorithm. If the table was consuming more than its

provisioned capacity then capacity of the partitions which re-

ceived the boost would be decreased. The autoadmin system

ensured that partitions receiving boost were relocated to an

appropriate node that had the capacity to serve the increased

throughput, however like bursting, adaptive capacity was also

best-effort but eliminated over 99.99% of the throttling due

to skewed access pattern.

4.2 Global admission control

Even though DynamoDB had substantially reduced the

throughput problem for non-uniform access using bursting

and adaptive capacity, both the solutions had limitations.

Bursting was only helpful for short-lived spikes in trafﬁc

and it was dependent on the node having throughput to sup-

port bursting. Adaptive capacity was reactive and kicked in

only after throttling had been observed. This meant that the

application using the table had already experienced brief pe-

riod of unavailability. The salient takeaway from bursting and

adaptive capacity was that we had tightly coupled partition

level capacity to admission control. Admission control was

distributed and performed at a partition level. DynamoDB

realized it would going to be beneﬁcial to remove admission

control from the partition and let the partition burst always

while providing workload isolation.

To solve the problem of admission control, DynamoDB re-

placed adaptive capacity with global admission control (GAC).

GAC builds on the same idea of token buckets. The GAC

service centrally tracks the total consumption of the table

capacity in terms of tokens. Each request router maintains a

local token bucket to make admission decisions and commu-

nicates with GAC to replenish tokens at regular intervals (in

the order of few seconds). GAC maintains an ephemeral state

computed on the ﬂy from client requests. Each GAC server

can be stopped and restarted without any impact on the overall

operation of the service. Each GAC server can track one or

more token buckets conﬁgured independently. All the GAC

servers are part of an independent hash ring. Request routers

manage several time-limited tokens locally. When a request

from the application arrives, the request router deducts tokens.

Eventually, the request router will run out of tokens because

of consumption or expiry. When the request router runs of

tokens, it requests more tokens from GAC. The GAC instance

uses the information provided by the client to estimate the

global token consumption and vends tokens available for the

next time unit to the client’s share of overall tokens. Thus, it

ensures that non-uniform workloads that send trafﬁc to only

a subset of items can execute up to the maximum partition

capacity.

In addition to the global admission control scheme, the

partition-level token buckets were retained for defense-in-

depth. The capacity of these token buckets is then capped to

ensure that one application doesn’t consume all or a signiﬁ-

cant share of the resources on the storage nodes.

4.3 Balancing consumed capacity

Letting partitions always burst required DynamoDB to man-

age burst capacity effectively. DynamoDB runs on a variety

of hardware instance types. These instance types vary by

throughput and storage capabilities. The latest generation

of storage nodes hosts thousands of partition replicas. The

partitions hosted on a single storage node could be wholly un-

related and belong to different tables. Hosting replicas from

multiple tables on a storage node, where each table could

be from a different customer and have varied trafﬁc patterns

involves deﬁning an allocation scheme that decides which

replicas can safely co-exist without violating critical prop-

erties such as availability, predictable performance, security,

and elasticity.

Colocation was a straightforward problem with provisioned

throughput tables. Colocation was more manageable in the

provisioned mode because of static partitions. Static partitions

made the allocation scheme reasonably simple. In the case

of provisioned tables without bursting and adaptive capacity,

allocation involved ﬁnding storage nodes that could accom-

modate a partition based on its allocated capacity. Partitions

1042 2022 USENIX Annual Technical Conference USENIX Association

were never allowed to take more trafﬁc than their allocated

capacity and, hence there were no noisy neighbors. All parti-

tions on a storage node did not utilize their total capacity at

a given instance. Bursting when trying to react to the chang-

ing workload meant that the storage node might go above its

prescribed capacity and thus made the colocation of tenants

a more complex challenge. Thus, the system packed storage

nodes with a set of replicas greater than the node’s overall

provisioned capacity. DynamoDB implemented a system to

proactively balance the partitions allocated across the storage

nodes based on throughput consumption and storage to miti-

gate availability risks caused by tightly packed replicas. Each

storage node independently monitors the overall throughput

and data size of all its hosted replicas. In case the throughput

is beyond a threshold percentage of the maximum capacity

of the node, it reports to the autoadmin service a list of can-

didate partition replicas to move from the current node. The

autoadmin ﬁnds a new storage node for the partition in the

same or another Availability Zone that doesn’t have a replica

of this partition.

4.4 Splitting for consumption

Even with GAC and the ability for partitions to burst al-

ways, tables could experience throttling if their trafﬁc was

skewed to a speciﬁc set of items. To address this problem,

DynamoDB automatically scales out partitions based on the

throughput consumed. Once the consumed throughput of a

partition crosses a certain threshold, the partition is split for

consumption. The split point in the key range is chosen based

on key distribution the partition has observed. The observed

key distribution serves as a proxy for the application’s access

pattern and is more effective than splitting the key range in

the middle. Partition splits usually complete in the order of

minutes. There are still class of workloads that cannot beneﬁt

from split for consumption. For example, a partition receiv-

ing high trafﬁc to a single item or a partition where the key

range is accessed sequentially will not beneﬁt from split. Dy-

namoDB detects such access patterns and avoids splitting the

partition.

4.5 On-demand provisioning

Many applications that migrated to DynamoDB previously

ran on-premises or on self-hosted databases. In either sce-

nario, the application developer had to provision servers. Dy-

namoDB provides a simpliﬁed serverless operational model

and a new model for provisioning - read and write capacity

units. Because the concept of capacity units was new to cus-

tomers, some found it challenging to forecast the provisioned

throughput. As mentioned in the beginning of this section,

customers either over provisioned, which resulted in low uti-

lization or under provisioned which resulted in throttles. To

improve the customer experience for spiky workloads, we

launched on-demand tables. On-demand tables remove the

burden from our customers of ﬁguring out the right provision-

ing for tables. DynamoDB provisions the on-demand tables

based on the consumed capacity by collecting the signal of

reads and writes and instantly accommodates up to double

the previous peak trafﬁc on the table. If an application needs

more than double the previous peak on table, DynamoDB

automatically allocates more capacity as the trafﬁc volume

increases to ensure that the workload does not experience

throttling. On-demand scales a table by splitting partitions for

consumption. The split decision algorithm is based on trafﬁc.

GAC allows DynamoDB to monitor and protect the system

from one application consuming all the resources. The ability

to balance based on consumed capacity effectively means

partitions of on-demand tables can be placed intelligently so

as to not run into node level limits.

5 Durability and correctness

Data should never be lost after it has been committed. In

practice, data loss can occur because of hardware failures,

software bugs, or hardware bugs. DynamoDB is designed for

high durability by having mechanisms to prevent, detect, and

correct any potential data losses.

5.1 Hardware failures

As with most database management systems, the write-ahead

logs [15] in DynamoDB are central for providing durability

and crash recovery. Write ahead logs are stored in all three

replicas of a partition. For higher durability, the write ahead

logs are periodically archived to S3, an object store that is

designed for 11 nines of durability. Each replica still contains

the most recent write-ahead logs that are usually waiting to

be archived. The unarchived logs are typically a few hundred

megabytes in size. In a large service, hardware failures such

as memory and disk failures are common. When a node fails,

all replication groups hosted on the node are down to two

copies. The process of healing a storage replica can take

several minutes because the repair process involves copying

the B-tree and write-ahead logs. Upon detecting an unhealthy

storage replica, the leader of a replication group adds a log

replica to ensure there is no impact on durability. Adding a

log replica takes only a few seconds because the system has to

copy only the recent write-ahead logs from a healthy replica

to the new replica without the B-tree. Thus, quick healing of

impacted replication groups using log replicas ensures high

durability of most recent writes.

5.2 Silent data errors

Some hardware failures can cause incorrect data to be stored

[5, 7]. In our experience, these errors can happen because of

the storage media, CPU, or memory [5]. Unfortunately, it’s

USENIX Association 2022 USENIX Annual Technical Conference 1043

very difﬁcult to detect these and they can happen anywhere in

the system. DynamoDB makes extensive use of checksums to

detect silent errors. By maintaining checksums within every

log entry, message, and log ﬁle, DynamoDB validates data

integrity for every data transfer between two nodes. These

checksums serve as guardrails to prevent errors from spread-

ing to the rest of the system. For example, a checksum is

computed for every message between nodes or components

and is veriﬁed because these messages can go through various

layers of transformations before they reach their destination.

Without such checks, any of the layers could introduce a silent

error.

Every log ﬁle that is archived to S3 has a manifest that

contains information about the log, such as a table, partition

and start and end markers for the data stored in the log ﬁle.

The agent responsible for archiving log ﬁles to S3 performs

various checks before uploading the data. These include and

are not limited to veriﬁcation of every log entry to ensure that

it belongs to the correct table and partition, veriﬁcation of

checksums to detect any silent errors, and veriﬁcation that

the log ﬁle doesn’t have any holes in the sequence numbers.

Once all the checks are passed, the log ﬁle and its manifest

are archived. Log archival agents run on all three replicas of

the replication group. If one of the agents ﬁnds that a log ﬁle

is already archived, the agent downloads the uploaded ﬁle to

verify the integrity of the data by comparing it with its local

write-ahead log. Every log ﬁle and manifest ﬁle are uploaded

to S3 with a content checksum. The content checksum is

checked by S3 as part of the put operation, which guards

against any errors during data transit to S3.

5.3 Continuous veriﬁcation

DynamoDB also continuously veriﬁes data at rest. Our goal

is to detect any silent data errors or bit rot in the system.

An example of such a continuous veriﬁcation system is the

scrub process. The goal of scrub is to detect errors that we

had not anticipated, such as bit rot. The scrub process runs

and veriﬁes two things: all three copies of the replicas in a

replication group have the same data, and the data of the live

replicas matches with a copy of a replica built ofﬂine using

the archived write-ahead log entries. The process of building

a replica using archived logs is explained in section 5.5 below.

The veriﬁcation is done by computing the checksum of the

live replica and matching that with a snapshot of one gener-

ated from the log entries archived in S3. The scrub mechanism

acts as a defense in depth to detect divergences between the

live storage replicas with the replicas built using the history

of logs from the inception of the table. These comprehensive

checks have been very beneﬁcial in providing conﬁdence in

the running system. A similar technique of continuous veri-

ﬁcation is used to verify replicas of global tables. Over the

years, we have learned that continuous veriﬁcation of data-at-

rest is the most reliable method of protecting against hardware

failures, silent data corruption, and even software bugs.

5.4 Software bugs

DynamoDB is a distributed key-value store that’s built on a

complex substrate. High complexity increases the probability

of human error in design, code, and operations. Errors in the

system could cause loss or corruption of data, or violate other

interface contracts that our customers depend on. We use

formal methods [16] extensively to ensure the correctness

of our replication protocols. The core replication protocol

was speciﬁed using TLA+ [12,13]. When new features that

affect the replication protocol are added, they are incorporated

into the speciﬁcation and model checked. Model checking

has allowed us to catch subtle bugs that could have led to

durability and correctness issues before the code went into

production. Other services such as S3 [6] have also found

model-checking useful in similar scenarios.

We also employ extensive failure injection testing and

stress testing to ensure the correctness of every piece of soft-

ware deployed. In addition to testing and verifying the repli-

cation protocol of the data plane, formal methods have also

been used to verify the correctness of our control plane and

features such as distributed transactions.

5.5 Backups and restores

In addition to guarding against physical media corruption, Dy-

namoDB also supports backup and restore to protect against

any logical corruption due to a bug in a customer’s application.

Backups or restores don’t affect performance or availability

of the table as they are built using the write-ahead logs that

are archived in S3. The backups are consistent across mul-

tiple partitions up to the nearest second. The backups are

full copies of DynamoDB tables and are stored in an Ama-

zon S3 bucket. Data from a backup can be restored to a new

DynamoDB table at any time.

DynamoDB also supports point-in-time restore. Using

point-in-time restore, customers can restore the contents of

a table that existed at any time in the previous 35 days to

a different DynamoDB table in the same region. For tables

with the point-in-time restore enabled, DynamoDB creates

periodic snapshots of the partitions that belong to the table

and uploads them to S3. The periodicity at which a partition

is snapshotted is decided based on the amount of write-ahead

logs accumulated for the partition. The snapshots, in conjunc-

tion to write-ahead logs, are used to do point-in-time restore.

When a point-in-time restore is requested for a table, Dy-

namoDB identiﬁes the closest snapshots to the requested time

for all the partitions of the tables, applies the logs up to the

timestamp in the restore request, creates a snapshot of the

table, and restores it.

1044 2022 USENIX Annual Technical Conference USENIX Association

6 Availability

To achieve high availability, DynamoDB tables are distributed

and replicated across multiple Availability Zones (AZ) in a

Region. DynamoDB regularly tests resilience to node, rack,

and AZ failures. For example, to test the availability and

durability of the overall service, power-off tests are exercised.

Using realistic simulated trafﬁc, random nodes are powered

off using a job scheduler. At the end of all the power-off tests,

the test tools verify that the data stored in the database is

logically valid and not corrupted. This section expands on

some of the challenges solved in the last decade to ensure

high availability.

6.1 Write and consistent read availability

A partition’s write availability depends on its ability to have a

healthy leader and a healthy write quorum. A healthy write

quorum in the case of DynamoDB consists of two out of

the three replicas from different AZs. A partition remains

available as long as there are enough healthy replicas for a

write quorum and a leader. A partition will become unavail-

able for writes if the number of replicas needed to achieve

the minimum quorum are unavailable. If one of the replicas

is unresponsive, the leader adds a log replica to the group.

Adding a log replica is the fastest way to ensure that the write

quorum of the group is always met. This minimizes disrup-

tion to write availability due to an unhealthy write quorum.

The leader replica serves consistent reads. Introducing log

replicas was a big change to the system, and the formally

proven implementation of Paxos provided us the conﬁdence

to safely tweak and experiment with the system to achieve

higher availability. We have been able to run millions of Paxos

groups in a Region with log replicas. Eventually consistent

reads can be served by any of the replicas. In case a leader

replica fails, other replicas detect its failure and elect a new

leader to minimize disruptions to the availability of consistent

reads.

6.2 Failure detection

A newly elected leader will have to wait for the expiry of the

old leader’s lease before serving any trafﬁc. While this only

takes a couple of seconds, the elected leader cannot accept

any new writes or consistent read trafﬁc during that period,

thus disrupting availability. One of the critical components

for a highly available system is failure detection for the leader.

Failure detection must be quick and robust to minimize dis-

ruptions. False positives in failure detection can lead to more

disruptions in availability. Failure detection works well for

failure scenarios where every replica of the group loses con-

nection to the leader. However, nodes can experience gray

network failures. Gray network failures can happen because

of communication issues between a leader and follower, is-

sues with outbound or inbound communication of a node, or

front-end routers facing communication issues with the leader

even though the leader and followers can communicate with

each other. Gray failures can disrupt availability because there

might be a false positive in failure detection or no failure de-

tection. For example, a replica that isn’t receiving heartbeats

from a leader will try to elect a new leader. As mentioned in

the section above, this can disrupt availability. To solve the

availability problem caused by gray failures, a follower that

wants to trigger a failover sends a message to other replicas

in the replication group asking if they can communicate with

the leader. If replicas respond with a healthy leader message,

the follower drops its attempt to trigger a leader election. This

change in the failure detection algorithm used by DynamoDB

signiﬁcantly minimized the number of false positives in the

system, and hence the number of spurious leader elections.

6.3 Measuring availability

DynamoDB is designed for 99.999 percent availability for

global tables and 99.99 percent availability for Regional tables.

Availability is calculated for each 5-minute interval as the per-

centage of requests processed by DynamoDB that succeed.

To ensure these goals are being met, DynamoDB continu-

ously monitors availability at service and table levels. The

tracked availability data is used to analyze customer perceived

availability trends and trigger alarms if customers see errors

above a certain threshold. These alarms are called customer-

facing alarms (CFA). The goal of these alarms is to report

any availability-related problems and proactively mitigate the

problem either automatically or through operator interven-

tion. In addition to real-time tracking, the system runs daily

jobs that trigger aggregation to calculate aggregate availabil-

ity metrics per customer. The results of the aggregation are

uploaded to S3 for regular analysis of availability trends.

DynamoDB also measures and alarms on availability ob-

served on the client-side. There are two sets of clients used

to measure the user-perceived availability. The ﬁrst set of

clients are internal Amazon services using DynamoDB as

the data store. These services share the availability metrics

for DynamoDB API calls as observed by their software. The

second set of clients is our DynamoDB canary applications.

These applications are run from every AZ in the Region, and

they talk to DynamoDB through every public endpoint. Real

application trafﬁc allows us to reason about DynamoDB avail-

ability and latencies as seen by our customers and catch gray

failures [10, 11]. They are a good representation of what our

customers might be experiencing both as long and short-term

trends.

6.4 Deployments

Unlike a traditional relational database, DynamoDB takes

care of deployments without the need for maintenance win-

USENIX Association 2022 USENIX Annual Technical Conference 1045

dows and without impacting the performance and availability

that customers experience. Software deployments are done

for various reasons, including new features, bug ﬁxes, and

performance optimizations. Often deployments involve updat-

ing numerous services. DynamoDB pushes software updates

at a regular cadence. A deployment takes the software from

one state to another state. The new software being deployed

goes through a full development and test cycle to build conﬁ-

dence in the correctness of the code. Over the years, across

multiple deployments, DynamoDB has learned that it’s not

just the end state and the start state that matter; there could

be times when the newly deployed software doesn’t work

and needs a rollback. The rolled-back state might be different

from the initial state of the software. The rollback procedure

is often missed in testing and can lead to customer impact.

DynamoDB runs a suite of upgrade and downgrade tests at a

component level before every deployment. Then, the software

is rolled back on purpose and tested by running functional

tests. DynamoDB has found this process valuable for catch-

ing issues that otherwise would make it hard to rollback if

needed.

Deploying software on a single node is quite different from

deploying software to multiple nodes. The deployments are

not atomic in a distributed system, and, at any given time,

there will be software running the old code on some nodes

and new code on other parts of the ﬂeet. The additional chal-

lenge with distributed deployments is that the new software

might introduce a new type of message or change the protocol

in a way that old software in the system doesn’t understand.

DynamoDB handles these kinds of changes with read-write

deployments. Read-write deployment is completed as a multi-

step process. The ﬁrst step is to deploy the software to read

the new message format or protocol. Once all the nodes can

handle the new message, the software is updated to send new

messages. New messages are enabled with software deploy-

ment as well. Read-write deployments ensure that both types

of messages can coexist in the system. Even in the case of

rollbacks, the system can understand both old and new mes-

sages.

All the deployments are done on a small set of nodes before

pushing them to the entire ﬂeet of nodes. The strategy reduces

the potential impact of faulty deployments. DynamoDB sets

alarm thresholds on availability metrics (mentioned in section

6.3). If error rates or latency exceed the threshold values

during deployments, the system triggers automatic rollbacks.

Software deployments to storage nodes trigger leader failovers

that are designed to avoid any impact to availability. The

leader replicas relinquish leadership and hence the group’s

new leader doesn’t have to wait for the old leader’s lease to

expire.

6.5 Dependencies on external services

To ensure high availability, all the services that DynamoDB

depends on in the request path should be more highly avail-

able than DynamoDB. Alternatively, DynamoDB should be

able to continue to operate even when the services on which

it depends are impaired. Examples of services DynamoDB

depends on for the request path include AWS Identity and

Access Management Services (IAM) [2], and AWS Key Man-

agement Service (AWS KMS) [3] for tables encrypted using

customer keys. DynamoDB uses IAM and AWS KMS to au-

thenticate every customer request. While these services are

highly available, DynamoDB is designed to operate when

these services are unavailable without sacriﬁcing any of the

security properties that these systems provide.

In the case of IAM and AWS KMS, DynamoDB employs

a statically stable design [18], where the overall system keeps

working even when a dependency becomes impaired. Per-

haps the system doesn’t see any updated information that its

dependency was supposed to have delivered. However, every-

thing before the dependency became impaired continues to

work despite the impaired dependency. DynamoDB caches

result from IAM and AWS KMS in the request routers that

perform the authentication of every request. DynamoDB peri-

odically refreshes the cached results asynchronously. If IAM

or KMS were to become unavailable, the routers will continue

to use the cached results for pre-determined extended period.

Clients that send operations to request routers that don’t have

the cached results will see an impact. However, we have seen

a minimal impact in practice when AWS KMS or IAM is

impaired. Moreover, caches improve response times by re-

moving the need to do an off-box call, which is especially

valuable when the system is under high load.

6.6 Metadata availability

One of the most important pieces of metadata the request

routers needs is the mapping between a table’s primary keys

and storage nodes. At launch, DynamoDB stored the metadata

in DynamoDB itself. This routing information consisted of

all the partitions for a table, the key range of each partition,

and the storage nodes hosting the partition. When a router

received a request for a table it had not seen before, it down-

loaded the routing information for the entire table and cached

it locally. Since the conﬁguration information about partition

replicas rarely changes, the cache hit rate was approximately

99.75 percent. The downside is that caching introduces bi-

modal behavior. In the case of a cold start where request

routers have empty caches, every DynamoDB request would

result in a metadata lookup, and so the service had to scale

to serve requests at the same rate as DynamoDB. This effect

has been observed in practice when new capacity is added

to the request router ﬂeet. Occasionally the metadata service

trafﬁc would spike up to 75 percent. Thus, introducing new

1046 2022 USENIX Annual Technical Conference USENIX Association

request routers impacted the performance and could make the

system unstable. In addition, an ineffective cache can cause

cascading failures to other parts of the system as the source

of data falls over from too much direct load [4].

DynamoDB wanted to remove and signiﬁcantly reduce

the reliance on the local cache for request routers and other

metadata clients without impacting the latency of the cus-

tomer requests. When servicing a request, the router needs

only information about the partition hosting the key for the

request. Therefore, it was wasteful to get the routing infor-

mation for the entire table, especially for large tables with

many partitions. To mitigate against metadata scaling and

availability risks in a cost-effective fashion, DynamoDB built

an in-memory distributed datastore called MemDS. MemDS

stores all the metadata in memory and replicates it across the

MemDS ﬂeet. MemDS scales horizontally to handle the entire

incoming request rate of DynamoDB. The data is highly com-

pressed. The MemDS process on a node encapsulates a Perkle

data structure, a hybrid of a Patricia tree [17] and a Merkle

tree. The Perkle tree allows keys and associated values to be

inserted for subsequent lookup using the full key or a key

preﬁx. Additionally, as keys are stored in sorted order, range

queries such as lessThan, greaterThan, and between are also

supported. The MemDS Perkle tree additionally supports two

special lookup operations:

floor

and

ceiling

. The

floor

operation accepts a key and returns a stored entry from the

Perkle whose key is less than or equal to the given key. The

ceiling

operation is similar but returns the entry whose key

is greater than or equal to the given key.

A new partition map cache was deployed on each request

router host to avoid the bi-modality of the original request

router caches. In the new cache, a cache hit also results in

an asynchronous call to MemDS to refresh the cache. Thus,

the new cache ensures the MemDS ﬂeet is always serving a

constant volume of trafﬁc regardless of cache hit ratio. The

constant trafﬁc to the MemDS ﬂeet increases the load on the

metadata ﬂeet compared to the conventional caches where

the trafﬁc to the backend is determined by cache hit ratio, but

prevents cascading failures to other parts of the system when

the caches become ineffective.

DynamoDB storage nodes are the authoritative source of

partition membership data. Partition membership updates

are pushed from storage nodes to MemDS. Each partition

membership update is propagated to all MemDS nodes. If the

partition membership provided by MemDS is stale, then the

incorrectly contacted storage node either responds with the

latest membership if known or responds with an error code

that triggers another MemDS lookup by the request router.

7 Micro benchmarks

To show that scale doesn’t affect the latencies observed by

applications, we ran YCSB [8] workloads of types A (50

percent reads and 50 percent updates) and B (95 percent reads

Figure 5: Summary of YCSB read latencies

Figure 6: Summary of YCSB write latencies

and 5 percent updates). Both benchmarks used a uniform

key distribution and items of size 900 bytes. The workloads

were run against production DynamoDB in the North Virginia

region. The workloads were scaled from 100 thousand total

operations per second to 1 million total operations per second.

Figure 5 shows the read latencies of both workloads at

and

percentiles. The purpose of the graph is to show, even

at different throughput, DynamoDB read latencies show very

little variance and remain identical even as the throughput

of the workload is increased. The read throughput of the

Workload B is twice that Workload A and still the latencies

show very little variance. Figure 6 shows the writes latencies

of both workloads at 50

and 99

percentiles. Like the read

latencies, the write latencies remain constant no matter the

throughput of the workload. In case of YCSB, workload A

drives a higher throughput than workload B, but the write

latency proﬁle for both workloads are similar.

8 Conclusion

DynamoDB has pioneered the space of cloud-native NoSQL

databases. It is a critical component of thousands of applica-

USENIX Association 2022 USENIX Annual Technical Conference 1047

tions used daily for shopping, food, transportation, banking,

entertainment, and so much more. Developers rely on its

ability to scale data workloads while providing steady perfor-

mance, high availability, and low operational complexity. For

more than 10 years, DynamoDB has maintained these key

properties and extended its appeal to application developers

with game-changing features such as on-demand capacity,

point-in-time backup and restore, multi-Region replication,

and atomic transactions.

9 Acknowledgements

DynamoDB has beneﬁted greatly from its customers whose

continuous feedback pushed us to innovate on their behalf. We

have been lucky to have an amazing team working with us on

this journey. We thank Shawn Bice, Rande Blackman, Marc

Brooker, Lewis Bruck, Andrew Certain, Raju Gulabani, James

Hamilton, Long Huang, Yossi Levanoni, David Lutz, Max-

imiliano Maccanti, Rama Pokkunuri, Tony Petrossian, Jim

Scharf, Khawaja Shams, Stefano Stefani, Allan Vermuellen,

Wei Xiao and the entire DynamoDB team for the impact-

ful contributions during the course of this evolution. Many

people have helped to improve this paper. We thank the anony-

mous reviewers who helped shape the paper. Special thanks

to Darcy Jayne, Kiran Reddy, and Andy Warﬁeld for going

above and beyond to help.

References

[1]

Amazon SimpleDB: Simple Database Service.

https://aws.amazon.com/simpledb/.

[2]

AWS Identity and Account Management Service.

https://aws.amazon.com/iam/.

[3]

AWS Key Management Service.

https://aws.amazon.com/kms/.

[4]

Summary of the amazon dynamodb service disrup-

tion and related impacts in the us-east region. 2015.

https://aws.amazon.com/message/5467D2/.

[5]

L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H.

Arpaci-Dusseau, G. R. Goodson, and B. Schroeder. An

analysis of data corruption in the storage stack. ACM

Transactions on Storage (TOS), 4(3):1–28, 2008.

[6]

J. Bornholt, R. Joshi, V. Astrauskas, B. Cully, B. Kragl,

S. Markle, K. Sauri, D. Schleit, G. Slatton, S. Tasiran,

J. Van Geffen, and A. Warﬁeld. Using lightweight for-

mal methods to validate a key-value storage node in

amazon s3. In Proceedings of the ACM SIGOPS 28th

Symposium on Operating Systems Principles, SOSP ’21,

page 836–850, New York, NY, USA, 2021. Association

for Computing Machinery.

[7]

C. Constantinescu, I. Parulkar, R. Harper, and S. Micha-

lak. Silent data corruption—myth or reality? In 2008

IEEE International Conference on Dependable Systems

and Networks With FTCS and DCC (DSN), pages 108–

109. IEEE, 2008.

[8]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,

and R. Sears. Benchmarking cloud serving systems

with ycsb. In Proceedings of the 1st ACM symposium

on Cloud computing, pages 143–154, 2010.

[9]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakula-

pati, A. Lakshman, A. Pilchin, S. Sivasubramanian,

P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly

available key-value store. SIGOPS Oper. Syst. Rev.,

41(6):205–220, oct 2007.

[10]

T. Hauer, P. Hoffmann, J. Lunney, D. Ardelean, and

A. Diwan. Meaningful availability. In 17th

{

USENIX

}

Symposium on Networked Systems Design and Imple-

mentation ({NSDI} 20), pages 545–557, 2020.

[11]

P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang,

M. Chintalapati, and R. Yao. Gray failure: The achilles’

heel of cloud-scale systems. In Proceedings of the 16th

Workshop on Hot Topics in Operating Systems, pages

150–155, 2017.

[12]

L. Lamport. Specifying systems, volume 388. Addison-

Wesley Boston, 2002.

[13]

L. Lamport. The pluscal algorithm language. In Interna-

tional Colloquium on Theoretical Aspects of Computing,

pages 36–60. Springer, 2009.

[14]

L. Lamport et al. Paxos made simple. ACM Sigact News,

32(4):18–25, 2001.

[15]

C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and

P. Schwarz. Aries: A transaction recovery method sup-

porting ﬁne-granularity locking and partial rollbacks

using write-ahead logging. ACM Transactions on

Database Systems (TODS), 17(1):94–162, 1992.

[16]

C. Newcombe, T. Rath, F. Zhang, B. Munteanu,

M. Brooker, and M. Deardeuff. How amazon web ser-

vices uses formal methods. Communications of the

ACM, 58(4):66–73, 2015.

[17]

K. Sklower. A tree-based packet routing table for berke-

ley unix. In USENIX Winter, volume 1991, pages 93–99.

Citeseer, 1991.

[18]

B. Weiss and M. Furr. Static stability using availability

zones. https://aws.amazon.com/builders-library/static-

stability-using-availability-zones/.

1048 2022 USENIX Annual Technical Conference USENIX Association