Keyun Cheng

Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage

Download

NSDI, 2020

Summary

This paper shows that existing distributed storage systems with strong consistency guarantee are not optimal in latency-cost tradeoff. It also presents PANDO, a strong-consistent distributed storage system, which achieves lower latency by reducing the total number of round-trips of communication for r/w operations with Paxos consensus algorithm, and utilizes erasure-coding (RS) for reducing storage costs, rather than previous works which adopts replication. Under the assumption that r/w conflicts are rare, experiments show that PANDO approximately achieves the lower bound of r-latency / storage costs as well as r/w latency.

Details

Key ideas:
- Scenario: hot data, Google Docs, ShareLaTeX that have strong consistency requirements over wide-area network / geo-distributed storage, while supporting fault-tolerance and cost saving
- Main issues in previous works:
  - STOA: EPaxos with replication, both latency and storage lower bound are far from theoretical lower bound as proposed.
    - note: the computation bound are calculated by solving a set of interger programs
    - EPaxos only requires one round of communication between frontend and the nearestby majority of replicas
    - problems: storage overhead by replication, and given fixed cost, read latency significantly higher than the achievable lower bound
  - Why not Raft or leader based solution? Cost of leader migration and least requisition is too high.
  - Why EC? The computation cost (CPU) of erasure coding with ISA-L is negligible compared with communication cost and storage overhead, this is the core reason to use EC rather than replication. See Figure 14 for illustration.
- Proposed idea: extend one of several one-round variants of Paxos to work on erasure-coded data, or take a classical two-round Paxos and address latency problem directly.
  - existing work: RS-Paxos
    - problem: r/w latency is twice of EPaxos’s because of EC.
- Solution:
  - Shrinking Phase 1 quorums: reuse read quorums
  - partially delegating write logic: offload Paxos’s write logic to a deligate, with the heterogenity settings
  - Use smaller quorums except failure or conflict
  - Make use leader when write conflicts
- Evaluation: compare with EPaxos, RS-Paxos and their extensions
  - Deployment: Geo-distributed over Azure
  - Metrics: Latency, storage overhead and GapVolume, a metric that evaluates how close each approach is to the lower bound, throughput
  - They also evaluate application workload via GitLab ops.

Strength

Consensus + EC for hot storage, and proven to close to theoretical latency/storage overhead lower-bound

Weakness

This paper didn’t mention the recovery if some node fails, or node maintance.
This paper mentions few about the metadata handling.

(Question: ) Besides RS, are other erasure codes compatible with the system? What about larger n, k?