Keyun Cheng

Reading Notes: ATC’21 RepairBoost

Title: Boosting Full-Node Repair in Erasure-Coded Storage

Conference (ATC’21): Link

Journal (): Link

Summary

This paper proposes a repair scheduling scheme called RepairBoost to solve the single full-node repair problem (1) load-balance the upload / download traffic across different storage nodes; (2) parallelize the repair of multiple independent blocks from different stripes to fully utilize the network bandwidth at each time slots. It further propose extensions to multiple node repair and heterogeneous network settings. Experiments over EC2 shows the effectiveness of the proposed repair scheme.

Main Contributions

Modeling and algorithm
- Step 1: Characterize repair solutions (algorithms) via Directly Acyclic Graphs (DAG) called RDAG
  - See the example of RDAG for PPR in Figure 6.
  - Difference with ECDAG (FAST’19 OpenEC): the vertices models the nodes, the edges models the dependencies on linear combinations
- Step 2: Load balancing by assigning the vertices of multiple RDAGs corresponding multiple blocks (from independent stripes) to physical nodes
  - Core of the assigment algorithm: sorting the intermediate vertices by the number of blocks to upload / download, and do a balance assignment
    - Basiclly it’s a greedy algorithm
- Step 3: Scheduling the repair operations
  - Constructing multiple maximum flows, and reducing the sizes of multiple RDAGs simutaneously
  - Repeat the process until all vertices are handled (i.e., all leaf, intermediate and final blocks are assigned to corresponding nodes in a scheduled way)
Extensions
- Multi-node repair scheduling
  - Repair each node individually
  - Prioritize the stripes with more blocks to repair, or with popularly accessed chunks
- Heterogeneous environment
  - deduce the time to upload / download a chunk first
    - Make sure that the upload / download speed for a chunk is almost similar (this is a little weird)
  - Sort and choose the links by the available bandwidth first
- Adaptation to network conditions
  - Monitor the run-time network bandwidth, and proactively adjust the repair solutions of the next sub-processes
Implementation
- Standalone + HDFS integration (similar to OpenEC)
Evaluation
- AWS EC2 Deployment: similar to OpenEC with 17 instances\
- Coding schemes: RS code, LRC and Butterfly codes
- Repair algorithms: conventional, PPR and ECPipe

Details

Strength

A novel modeling to characterize the repair solutions for full-node repair, which load-balances the repair load and effectively parallelizes the repair process for single blocks from multiple stripes
Implementation and extensive evaluation

Weakness

I doubt about the complexity when it models the codes with sub-packetization, which by nature have more complicated linear dependencies of repair (decoding) operations (e.g., Clay codes). From my understanding, the modeling will be much more complicated.
- I cannot find the repair performance regenerating codes (Bufferfly codes for example). It only shows the load balancing degree.
- What’s the overhead when doing the scheduling of the repair operations?