original rfc: https://www.ietf.org/rfc/rfc4122.txt
original rfc: https://www.ietf.org/rfc/rfc4122.txt
Raft is a concensus protocol that aims to improve upon Paxos by being more understandable. It claims to be as efficient as Paxos. The paper outlines an implementation in great detail. To quote: 'The greatest difference between Raft and Paxos is Raft’s strong leadership: Raft uses leader election as an essential part of the consensus protocol, and it concentrates as much functionality as possible in the leader.' By basing actions on a 'consistent leader', subsequent actions and state space is simplified
Note: It is not possible to summarize the details of the Raft implementation so I will instead cover some high-level points/decision/tradeoffs by by the authors(take a look at the paper which highlights important content). Also sections (6)Log Compaction and (7)Cluster membership changes are currently left out.
It should be noted that there is distinction between the entry
state machine. The log is a mechanism used to make progress while an update to the state machine(an index pointing to the log) precludes that a entry and all previous entries have been
committed(a decision was made).
Raft seperates the protocol into 3 different modes (leader election, log replication and safety). It first elects a leader(choose a new one if current one fails) and gives it complete control for managing the replicated log(accept entries from client, replicating entries and committing entries). The safety property guarantees that if an entry is committed to the state machine, all members of the cluster agreed upon the value.
A member of the cluster can only be in one of 3 states: leader, follower, candidate. A
term is used as a logical clock in Raft(clock skew makes this a good choice). There are two RPC in Raft (RequestVote and AppendEntries) and they are idempotent.
Leader Election: A heartbeat mechanism is used to track the liveliness of the leader. There is only ever 1 leader (term is used to decide the which leader is valid). A failure to elect a leader(split vote, packet loss) will cause members to timout (150-300ms) and issue a RequestVote of their own and thus starting a new election.
Log Replication: A leader is the only one that can modify the logs. It services client request and sends AppendEntries RPC in parallel to members. Once an entry has been safely replicated, it will answer the client. At the start of leadership, the leader will force all members to match its logs and therefore delete log entries from followers if necessary. The log on the leader is only ever append only.
Safety: A restriction for voting for a leader is that the RequestVote RPC includes log information and the voter denies the vote if its own log is more up-to-date than that of the candidate. This ensures that the winning candidate must contain all committed entries since a majority vote is needed.
A caveat: clients send all interactions to the leader(a follower will reject and provide information about the current leader). Therefore the leader is responsible for handling all reads and writes which could cause hot-spot on a single server. Additionally, even a read-only operation requires the leader to verify it has the latest information and thus contact a majority of the members.
A caveat: Paxos is a paper and implementations vary out in the field. Egalitarian Paxos (EPaxos) can achieve higher performance under some conditions with a leaderless approach; however it is more complex.
This paper is structured more like a whitepaper than a research paper and gives really nice insight into engineering a high performant DB. Scylla is a drop-in replacement for Cassandra and promises to provide better performance. It claims to do so by leveraging modern harware features and automation. The paper explores 7 key design decisions that helped guide Scylla's development.
Using C++ instead of Java meant avoiding GC latencies in JVM, while gaining precise control over memory and being able to access advance kernal features.
Scylla chose to be compatible with Cassandra drivers, query language and ecosystem and was therefore able to market to the existing vast Cassandra community.
Async architecture was used for all I/O and CPU operations in Scylla. This not only avoided traditional concurrency problems and also capped the performance to system resources rather than the application framework. See Seastar for more info on the async engine used in Scylla.
A shared-nothing (one shard/core)architecture was used to avoid using locks. This lets the application developer avoid context, cache invalidation, and locks. Instead shared memory queues are used to communicated between shards. So like Cassandra, the entire dataset of the cluster is sharded to nodes(machines), but additionally, the data on the node is sharded per core.
Cassandra uses a key cache and row cache in addition to the general purpose Linux page cache. This adds complixity with unpredictable performance. Scylla chose to implement a singe unified cache which can tune itself depending on the current workload. It also uses things like direct memory access(DMA) to access disk when there is a cache miss. DMA operates independent of CPU.
Scylla utilizes I/O schedulers to self tune and balance between foreground and background I/O operations. The tool
scylla_ io_setup is used to ensure maximum useful disk concurrency while maintaining good latency and not overloading drives.
The self-tuning capabilities allow Scylla to make informed decisions without manual intervention or needing to hire an expert. This design is used in compaction, caches and many other parts of the system.
In distributed systems there exists a problem of how to spread load across many servers in a scalable manner. Consistent hashing is a hashing design that allows for distribution with minimal coordination between the machines.
A first solution might be to hash the requests across the different servers. In terms of a caching, this means to take item (
i), hash it (good distribution) and then assign it to the available caches (
C), maybe with modulo:
hash\_fn(i) % C. There are two problems with this approach. First is that the distribution is a function of the servers, so as we add or remove servers, we now need to recalculate the modulo and therefore the distribution. The second problem, which derives from the first, is that now each server node now needs to be updated - said differently, each server node now needs to be aware of every node. Consensus is hard, so this is not ideal.
Section 4 contained proof and discussion of Consistent Hashing.
The authors define 4 properties that define their notion of consistency: balance requires that the hash function distribute the objects in a balanced fashion amongst the buckets. monotonicity says if items are initially assigned to a set of buckets and then some new buckets are added, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. spread implies that references for a given object are only directed to a small number of caching machines. load implies that no one cache is assigned an unreasonable number of objects.objects.
Next they define some properties of a good ranged hash function: Given
Rb(hash for buckets) and
Ri(hash for items). An item
i should be assigned to the closest bucket. Given max of
C buckets. For each bucket create
k*log(C), some constant
k virtual buckets and map them using the hash function
Rb. To save space, have
Ri map to the range
[0,1]. To differentiate a point from another, we only needs to have
log(number of points) random bits (decimal precision) identifying a point.
Lastly they share an efficient implementation of a possible consistent hash function: Use a balanced binary search tree to store the buckets and their assignment in the range. If there are
C buckets, then there will be
k*C*log(C) entries, which gives a worse tree depth of
log(C) and a worse possible calculation of
log(C). The time for adding and removing a bucket is
log^2(C) since we would need to remove
k*log(C) points. To reduce this time to constant lookup, the authors suggest dividing the range into
k*C*log(C) equal length segments and have a separate tree for each one. This allows us to more predictably have one bucket per segment. The problem with segments is that it requires smaller segments as more buckets are added. An amortized way recommended by the authors is to choose intervals of size
1/2^x such that
1/2^x < 1/k*C*log(C). Then as new buckets are added, gradually bisect each section.
The consensus problem involves a system of asynchronous processes, some of which may be unreliable/faulty (die or crash). This paper proves that every solution, with even a single faulty process, has the possibility of non-termination. The important takeaway: we can design our systems to make the possibility of a livelock small, but the probability is non-zero.
Despite being a short paper, the proofs themselves were pretty confusing. This post was critical for my understanding.
The proof is comprised of two lemmas. The first lemma showed that there must be some initial configuration where consensus is not yet determined (caused by errors or message delays). Think of this as bivalent (having two possible truths) state. The second lemma shows that it is always possible to remain in a bivalent state by delaying a message.
Lets note the assumptions made by the proof: The underlying transport protocol is reliable; messages are delivered correctly and exactly once. There is no synchronized clock, and it is not possible to detect a slow process vs a dead process. Additionally, the proof relaxes the termination requirement and requires that only some process eventually decide on a value (weak consensus). Put another way, a message will always be delivered but can be delayed and can be delivered out-of-order.
For Strong Consensus these 3 properties must hold: termination: eventually, every correct process decides some value. agreement: all processes that decide do so on the same value. validity: the value must have been proposed by some process.
By proving their results under assumptions of weak concensus, the authors are able to extend the result to strong concensus systems.