Building the Real-Time Big Data Database: Seven Design Principles behind Scylla [paper]

This paper is structured more like a whitepaper than a research paper and gives really nice insight into engineering a high performant DB. Scylla is a drop-in replacement for Cassandra and promises to provide better performance. It claims to do so by leveraging modern harware features and automation. The paper explores 7 key design decisions that helped guide Scylla's development.

expand
  1. Using C++ instead of Java meant avoiding GC latencies in JVM, while gaining precise control over memory and being able to access advance kernal features.

  2. Scylla chose to be compatible with Cassandra drivers, query language and ecosystem and was therefore able to market to the existing vast Cassandra community.

  3. Async architecture was used for all I/O and CPU operations in Scylla. This not only avoided traditional concurrency problems and also capped the performance to system resources rather than the application framework. See Seastar for more info on the async engine used in Scylla.

  4. A shared-nothing (one shard/core)architecture was used to avoid using locks. This lets the application developer avoid context, cache invalidation, and locks. Instead shared memory queues are used to communicated between shards. So like Cassandra, the entire dataset of the cluster is sharded to nodes(machines), but additionally, the data on the node is sharded per core.

  5. Cassandra uses a key cache and row cache in addition to the general purpose Linux page cache. This adds complixity with unpredictable performance. Scylla chose to implement a singe unified cache which can tune itself depending on the current workload. It also uses things like direct memory access(DMA) to access disk when there is a cache miss. DMA operates independent of CPU.

  6. Scylla utilizes I/O schedulers to self tune and balance between foreground and background I/O operations. The tool scylla_ io_setup is used to ensure maximum useful disk concurrency while maintaining good latency and not overloading drives.

  7. The self-tuning capabilities allow Scylla to make informed decisions without manual intervention or needing to hire an expert. This design is used in compaction, caches and many other parts of the system.

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web [paper] [code]

In distributed systems there exists a problem of how to spread load across many servers in a scalable manner. Consistent hashing is a hashing design that allows for distribution with minimal coordination between the machines.

expand

A first solution might be to hash the requests across the different servers. In terms of a caching, this means to take item (i), hash it (good distribution) and then assign it to the available caches (C), maybe with modulo: hash\_fn(i) % C. There are two problems with this approach. First is that the distribution is a function of the servers, so as we add or remove servers, we now need to recalculate the modulo and therefore the distribution. The second problem, which derives from the first, is that now each server node now needs to be updated - said differently, each server node now needs to be aware of every node. Consensus is hard, so this is not ideal.

Section 4 contained proof and discussion of Consistent Hashing.

The authors define 4 properties that define their notion of consistency: balance requires that the hash function distribute the objects in a balanced fashion amongst the buckets. monotonicity says if items are initially assigned to a set of buckets and then some new buckets are added, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. spread implies that references for a given object are only directed to a small number of caching machines. load implies that no one cache is assigned an unreasonable number of objects.objects.

Next they define some properties of a good ranged hash function: Given Rb(hash for buckets) and Ri(hash for items). An item i should be assigned to the closest bucket. Given max of C buckets. For each bucket create k*log(C), some constant k virtual buckets and map them using the hash function Rb. To save space, have Rb and Ri map to the range [0,1]. To differentiate a point from another, we only needs to have log(number of points) random bits (decimal precision) identifying a point.

Lastly they share an efficient implementation of a possible consistent hash function: Use a balanced binary search tree to store the buckets and their assignment in the range. If there are C buckets, then there will be k*C*log(C) entries, which gives a worse tree depth of log(C) and a worse possible calculation of log(C). The time for adding and removing a bucket is log^2(C) since we would need to remove k*log(C) points. To reduce this time to constant lookup, the authors suggest dividing the range into k*C*log(C) equal length segments and have a separate tree for each one. This allows us to more predictably have one bucket per segment. The problem with segments is that it requires smaller segments as more buckets are added. An amortized way recommended by the authors is to choose intervals of size 1/2^x such that 1/2^x < 1/k*C*log(C). Then as new buckets are added, gradually bisect each section.

Impossibility of Distributed Consensus with One Faulty Process [paper]

The consensus problem involves a system of asynchronous processes, some of which may be unreliable/faulty (die or crash). This paper proves that every solution, with even a single faulty process, has the possibility of non-termination. The important takeaway: we can design our systems to make the possibility of a livelock small, but the probability is non-zero.

expand

Despite being a short paper, the proofs themselves were pretty confusing. This post was critical for my understanding.

The proof is comprised of two lemmas. The first lemma showed that there must be some initial configuration where consensus is not yet determined (caused by errors or message delays). Think of this as bivalent (having two possible truths) state. The second lemma shows that it is always possible to remain in a bivalent state by delaying a message.

Lets note the assumptions made by the proof: The underlying transport protocol is reliable; messages are delivered correctly and exactly once. There is no synchronized clock, and it is not possible to detect a slow process vs a dead process. Additionally, the proof relaxes the termination requirement and requires that only some process eventually decide on a value (weak consensus). Put another way, a message will always be delivered but can be delayed and can be delivered out-of-order.

For Strong Consensus these 3 properties must hold: termination: eventually, every correct process decides some value. agreement: all processes that decide do so on the same value. validity: the value must have been proposed by some process.

By proving their results under assumptions of weak concensus, the authors are able to extend the result to strong concensus systems.