Making QUIC Quicker With NIC Offload [paper]

The authors measured different components of the QUIC protocol. They identified kernel to userspace communication (data copy), crypto, and packer reordering as CPU hungry tasks which could be offloaded to a NIC.

expand...

Kernel-userspace data copy accounts for 40-50% of total CPU usage. When kernel bypass is used, crypto operations become the bottleneck and consume up to 40% of CPU per connection. Packet re-ordering accounts for 5-20% of total CPU.

CPU usage in the multi-connection scenario and saw that the major difference compared to the single connection scenario is that the CPU time spent on packet sending, e.g., pkt formatting, crypto, pkt I/O, has increased by about 10%.

The cost of going from single connection to multi-connections is ~10% CPU.

For the crypto task, enc_aead() and dec_aead() calls were responsible for approx 90% of CPU time.

The QUIC impls Quant, Quicly, Picoquic and Facebook’s Mvfst were used for the analysis.

  • Mvfst seems to be the most balanced in terms of functionality and performance.
  • Quicly seems to lack performance compared to other impls.
  • Quant however is a research tool and lacks alot of real functionality.
  • Picoquic's CC seems to ignore packet loss?? or is this a packet re-ordering functionality?
    • Picoquic implements packet re-ordering engine which could help it maintain throughput when faced with packet loss/re-ordering. They had two different algorithms for the re-ordering engine (splay tree and linear search).

Netmap is an efficient I/O framework for sumulating network. TLEM is a utility which allows for controlling traffic perturbations, i.e., loss, delay, re-ordering.. etc. TLEM is an extension on Netmap.

Network Interface Cards (NICs) with programmable hardware components, e.g., network processor, FPGA, can help by easing the host CPU from expensive computation tasks.

Seems like the industry is looking into programable NICs.

Carousel: Scalable Traffic Shaping at End Hosts [paper]

Traffic shaping is a critical feature for datacenters. It is used to apply policy-based bandwidth allocation, for packet pacing, and used by rate-based congestion control. Typical traffic shaping mechanisms can be CPU expensive, while not being very accurate. Carousel improves upon the state of the ary(8% CPU savings); i) a single queue shaper, ii) fine-grained, just-in-time freeing of resources coupled to packet departure iii) one shaper per CPU core with lock-free coordination.

expand...

Network bandwidth is an expensive resource to overprovision and bursty links can lead to packet loss, less accurate bandwidth estimation, longer RTT times. Deep buffers, have typically been used but lead to poor latencies. Traffic shaping refers to pacing: injecting inter-packet gaps to smooth traffic and rate limiting: enforcing rate on flow-aggregate on connections. Within datacenters, the need to shape traffic is critial since there are multiple customer VMs which all compete for network bandwidth.

traditional shapers

HTB are good for enforcing rate limits on flow aggregates, but scales poorly with the number of rate limited aggregates. FQ/pacing is used for pacing individual TCP connections, but this solution does not support flow aggregates. Policers are used in hypervisor switch deployments where HTB or FQ/pacing are too CPU intensive.

Policers: a token bucket mechanism to assign resources to a flow, with zero buffering and packet dropping as a side effect.

HTB: a complex buffer/token bucket/tree structure to support advanced traffic management. Backpressure is needed to avoid unbounded queue growth. Can group flows into individual or aggregate groups and thus apply complex logic to traffic shaping.

FQ/pacing: FQ tracks per-flow sate in an array of Red-Black trees indexed on flow hash IDs. A deficit round robin scheduler is used for outgoing packets from active flows. A garbage collector deletes inactive flows. Pacing is based on packet length and current pacing rate. Future packets are kept in a separate RB tree index. This all sounds very CPU intensive.

cost of shaping ...

carousel design principles ...

QUIC: A UDP-Based Multiplexed and Secure Transport (rfc) [paper]

QUIC is a UDP based network protocol that will be the basis for HTTP/3. It moves much of the TCP logic from the network layer to the application layer. The implications of this are massive since hardware is very difficult to replace/upgrade compared with application code. It is also secure by default.

expand...

main rfc: https://datatracker.ietf.org/doc/html/rfc9000 tls rfc: https://datatracker.ietf.org/doc/html/rfc9001 recovery rfc: https://datatracker.ietf.org/doc/html/rfc9002

HMAC: Keyed-Hashing for Message Authentication (rfc 2104) [paper]

MAC, message authentication codes, is a mechanism to check the integrity of information transmitted over unreliable medium. HMAC is an implementation of MAC using a cryptographic hash function. SHA-3 is probably a good algorithm to use. HMAC is designed to allows for inter-chaning hash functions seamlessly, not incur performance penalty on top of the hash function, have simple key usage and have cryptographic strength based on the strength of the hash function.

expand...

original rfc: https://www.ietf.org/rfc/rfc2104.txt

introduction

The paper quotes MD5 and SHA-1 (SHA-3 should probably be used) as viable cryptographic hash functions. Along with a cryptographic algorithm, HMAC also requires a shared secret key.

HMAC has the following goals:

  • use existing cryptographic function, which can be interchanged
  • does not incur performance degradation on top of the cryptographic function
  • uses and handles keys in a simple way
  • cryptographic strength of the auth mechanism is based on the strength of the underlying hash function

definition of HMAC

H = hash function

K = secred key (it is recommended that the length of K be at minimum L)

B = length of input block

L = length of output block

ipad = the byte 0x36 repeated B times

opad = the byte 0x5C repeated B times

HMAC is calculated as = H(K XOR opad, H(K XOR ipad, text))

  • 1: append zeros to the end of K to create a B byte string (e.g., if K is of length 20 bytes and B=64, then K will be appended with 44 zero bytes 0x00)
  • 2: XOR (bitwise exclusive-OR) the B byte string computed in step (1) with ipad
  • 3: append the stream of data ’text’ to the B byte string resulting from step (2)
  • 4: apply H to the stream generated in step (3)
  • 5: XOR (bitwise exclusive-OR) the B byte string computed in step (1) with opad
  • 6: append the H result from step (4) to the B byte string resulting from step (5)
  • 7: apply H to the stream generated in step (6) and output the result

keys

I is advices that the length of K be atleast L otherwise security is decreased. Keys need to be chosen at random using a cryptographically strong pseudo-random generator and periodically refreshed.

truncated output

Truncation of the output MAC is a known practice since the entire length is not need to verify that the hash was computed. This can save on transmission bandwidth I imagine but not sure how useful it is. The recommendation is that the truncation not be less than half the length of the output and not less than 80 bits (might be more in modern days).

security

  • the construction of HMAC is independent of the details of a particular hash function
  • message authentication, as opposed to encryption, has a "transient" effect. so that breaking a cryptographic function today will require us to change the function to prevent future MACs. However, past messages remain verified the present break doesn't affect the past verification.
In Search of an Understandable Consensus Algorithm (Extended Raft) [paper]

Raft is a concensus protocol that aims to improve upon Paxos by being more understandable. It claims to be as efficient as Paxos. The paper outlines an implementation in great detail. To quote: 'The greatest difference between Raft and Paxos is Raft’s strong leadership: Raft uses leader election as an essential part of the consensus protocol, and it concentrates as much functionality as possible in the leader.' By basing actions on a 'consistent leader', subsequent actions and state space is simplified

expand...

Note: It is not possible to summarize the details of the Raft implementation so I will instead cover some high-level points/decision/tradeoffs by by the authors(take a look at the paper which highlights important content). Also sections (6)Log Compaction and (7)Cluster membership changes are currently left out.

It should be noted that there is distinction between the entry log and state machine. The log is a mechanism used to make progress while an update to the state machine(an index pointing to the log) precludes that a entry and all previous entries have been committed(a decision was made).

Raft seperates the protocol into 3 different modes (leader election, log replication and safety). It first elects a leader(choose a new one if current one fails) and gives it complete control for managing the replicated log(accept entries from client, replicating entries and committing entries). The safety property guarantees that if an entry is committed to the state machine, all members of the cluster agreed upon the value.

A member of the cluster can only be in one of 3 states: leader, follower, candidate. A term is used as a logical clock in Raft(clock skew makes this a good choice). There are two RPC in Raft (RequestVote and AppendEntries) and they are idempotent.

Leader Election: A heartbeat mechanism is used to track the liveliness of the leader. There is only ever 1 leader (term is used to decide the which leader is valid). A failure to elect a leader(split vote, packet loss) will cause members to timout (150-300ms) and issue a RequestVote of their own and thus starting a new election.

Log Replication: A leader is the only one that can modify the logs. It services client request and sends AppendEntries RPC in parallel to members. Once an entry has been safely replicated, it will answer the client. At the start of leadership, the leader will force all members to match its logs and therefore delete log entries from followers if necessary. The log on the leader is only ever append only.

Safety: A restriction for voting for a leader is that the RequestVote RPC includes log information and the voter denies the vote if its own log is more up-to-date than that of the candidate. This ensures that the winning candidate must contain all committed entries since a majority vote is needed.

A caveat: clients send all interactions to the leader(a follower will reject and provide information about the current leader). Therefore the leader is responsible for handling all reads and writes which could cause hot-spot on a single server. Additionally, even a read-only operation requires the leader to verify it has the latest information and thus contact a majority of the members.

A caveat: Paxos is a paper and implementations vary out in the field. Egalitarian Paxos (EPaxos) can achieve higher performance under some conditions with a leaderless approach; however it is more complex.

Building the Real-Time Big Data Database: Seven Design Principles behind Scylla [paper]

This paper is structured more like a whitepaper than a research paper and gives really nice insight into engineering a high performant DB. Scylla is a drop-in replacement for Cassandra and promises to provide better performance. It claims to do so by leveraging modern harware features and automation. The paper explores 7 key design decisions that helped guide Scylla's development.

expand...
  1. Using C++ instead of Java meant avoiding GC latencies in JVM, while gaining precise control over memory and being able to access advance kernal features.

  2. Scylla chose to be compatible with Cassandra drivers, query language and ecosystem and was therefore able to market to the existing vast Cassandra community.

  3. Async architecture was used for all I/O and CPU operations in Scylla. This not only avoided traditional concurrency problems and also capped the performance to system resources rather than the application framework. See Seastar for more info on the async engine used in Scylla.

  4. A shared-nothing (one shard/core)architecture was used to avoid using locks. This lets the application developer avoid context, cache invalidation, and locks. Instead shared memory queues are used to communicated between shards. So like Cassandra, the entire dataset of the cluster is sharded to nodes(machines), but additionally, the data on the node is sharded per core.

  5. Cassandra uses a key cache and row cache in addition to the general purpose Linux page cache. This adds complixity with unpredictable performance. Scylla chose to implement a singe unified cache which can tune itself depending on the current workload. It also uses things like direct memory access(DMA) to access disk when there is a cache miss. DMA operates independent of CPU.

  6. Scylla utilizes I/O schedulers to self tune and balance between foreground and background I/O operations. The tool scylla_ io_setup is used to ensure maximum useful disk concurrency while maintaining good latency and not overloading drives.

  7. The self-tuning capabilities allow Scylla to make informed decisions without manual intervention or needing to hire an expert. This design is used in compaction, caches and many other parts of the system.

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web [paper] [code]

In distributed systems there exists a problem of how to spread load across many servers in a scalable manner. Consistent hashing is a hashing design that allows for distribution with minimal coordination between the machines.

expand...

A first solution might be to hash the requests across the different servers. In terms of a caching, this means to take item (i), hash it (good distribution) and then assign it to the available caches (C), maybe with modulo: hash\_fn(i) % C. There are two problems with this approach. First is that the distribution is a function of the servers, so as we add or remove servers, we now need to recalculate the modulo and therefore the distribution. The second problem, which derives from the first, is that now each server node now needs to be updated - said differently, each server node now needs to be aware of every node. Consensus is hard, so this is not ideal.

Section 4 contained proof and discussion of Consistent Hashing.

The authors define 4 properties that define their notion of consistency: balance requires that the hash function distribute the objects in a balanced fashion amongst the buckets. monotonicity says if items are initially assigned to a set of buckets and then some new buckets are added, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. spread implies that references for a given object are only directed to a small number of caching machines. load implies that no one cache is assigned an unreasonable number of objects.objects.

Next they define some properties of a good ranged hash function: Given Rb(hash for buckets) and Ri(hash for items). An item i should be assigned to the closest bucket. Given max of C buckets. For each bucket create k*log(C), some constant k virtual buckets and map them using the hash function Rb. To save space, have Rb and Ri map to the range [0,1]. To differentiate a point from another, we only needs to have log(number of points) random bits (decimal precision) identifying a point.

Lastly they share an efficient implementation of a possible consistent hash function: Use a balanced binary search tree to store the buckets and their assignment in the range. If there are C buckets, then there will be k*C*log(C) entries, which gives a worse tree depth of log(C) and a worse possible calculation of log(C). The time for adding and removing a bucket is log^2(C) since we would need to remove k*log(C) points. To reduce this time to constant lookup, the authors suggest dividing the range into k*C*log(C) equal length segments and have a separate tree for each one. This allows us to more predictably have one bucket per segment. The problem with segments is that it requires smaller segments as more buckets are added. An amortized way recommended by the authors is to choose intervals of size 1/2^x such that 1/2^x < 1/k*C*log(C). Then as new buckets are added, gradually bisect each section.

Impossibility of Distributed Consensus with One Faulty Process [paper]

The consensus problem involves a system of asynchronous processes, some of which may be unreliable/faulty (die or crash). This paper proves that every solution, with even a single faulty process, has the possibility of non-termination. The important takeaway: we can design our systems to make the possibility of a livelock small, but the probability is non-zero.

expand...

Despite being a short paper, the proofs themselves were pretty confusing. This post was critical for my understanding.

The proof is comprised of two lemmas. The first lemma showed that there must be some initial configuration where consensus is not yet determined (caused by errors or message delays). Think of this as bivalent (having two possible truths) state. The second lemma shows that it is always possible to remain in a bivalent state by delaying a message.

Lets note the assumptions made by the proof: The underlying transport protocol is reliable; messages are delivered correctly and exactly once. There is no synchronized clock, and it is not possible to detect a slow process vs a dead process. Additionally, the proof relaxes the termination requirement and requires that only some process eventually decide on a value (weak consensus). Put another way, a message will always be delivered but can be delayed and can be delivered out-of-order.

For Strong Consensus these 3 properties must hold: termination: eventually, every correct process decides some value. agreement: all processes that decide do so on the same value. validity: the value must have been proposed by some process.

By proving their results under assumptions of weak concensus, the authors are able to extend the result to strong concensus systems.