Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web [paper] [code]

In distributed systems there exists a problem of how to spread load across many servers in a scalable manner. Consistent hashing is a hashing design that allows for distribution with minimal coordination between the machines.


A first solution might be to hash the requests across the different servers. In terms of a caching, this means to take item (i), hash it (good distribution) and then assign it to the available caches (C), maybe with modulo: hash\_fn(i) % C. There are two problems with this approach. First is that the distribution is a function of the servers, so as we add or remove servers, we now need to recalculate the modulo and therefore the distribution. The second problem, which derives from the first, is that now each server node now needs to be updated - said differently, each server node now needs to be aware of every node. Consensus is hard, so this is not ideal.

Section 4 contained proof and discussion of Consistent Hashing.

The authors define 4 properties that define their notion of consistency: balance requires that the hash function distribute the objects in a balanced fashion amongst the buckets. monotonicity says if items are initially assigned to a set of buckets and then some new buckets are added, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. spread implies that references for a given object are only directed to a small number of caching machines. load implies that no one cache is assigned an unreasonable number of objects.objects.

Next they define some properties of a good ranged hash function: Given Rb(hash for buckets) and Ri(hash for items). An item i should be assigned to the closest bucket. Given max of C buckets. For each bucket create k*log(C), some constant k virtual buckets and map them using the hash function Rb. To save space, have Rb and Ri map to the range [0,1]. To differentiate a point from another, we only needs to have log(number of points) random bits (decimal precision) identifying a point.

Lastly they share an efficient implementation of a possible consistent hash function: Use a balanced binary search tree to store the buckets and their assignment in the range. If there are C buckets, then there will be k*C*log(C) entries, which gives a worse tree depth of log(C) and a worse possible calculation of log(C). The time for adding and removing a bucket is log^2(C) since we would need to remove k*log(C) points. To reduce this time to constant lookup, the authors suggest dividing the range into k*C*log(C) equal length segments and have a separate tree for each one. This allows us to more predictably have one bucket per segment. The problem with segments is that it requires smaller segments as more buckets are added. An amortized way recommended by the authors is to choose intervals of size 1/2^x such that 1/2^x < 1/k*C*log(C). Then as new buckets are added, gradually bisect each section.

Impossibility of Distributed Consensus with One Faulty Process [paper]

The consensus problem involves a system of asynchronous processes, some of which may be unreliable/faulty (die or crash). This paper proves that every solution, with even a single faulty process, has the possibility of non-termination. The important takeaway: we can design our systems to make the possibility of a livelock small, but the probability is non-zero.


Despite being a short paper, the proofs themselves were pretty confusing. This post was critical for my understanding.

The proof is comprised of two lemmas. The first lemma showed that there must be some initial configuration where consensus is not yet determined (caused by errors or message delays). Think of this as bivalent (having two possible truths) state. The second lemma shows that it is always possible to remain in a bivalent state by delaying a message.

Lets note the assumptions made by the proof: The underlying transport protocol is reliable; messages are delivered correctly and exactly once. There is no synchronized clock, and it is not possible to detect a slow process vs a dead process. Additionally, the proof relaxes the termination requirement and requires that only some process eventually decide on a value (weak consensus). Put another way, a message will always be delivered but can be delayed and can be delivered out-of-order.

For Strong Consensus these 3 properties must hold: termination: eventually, every correct process decides some value. agreement: all processes that decide do so on the same value. validity: the value must have been proposed by some process.

By proving their results under assumptions of weak concensus, the authors are able to extend the result to strong concensus systems.