Keyun Cheng

Reading Notes: ICDE’20 BFT-Store

Title: BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding

Conference (ICDE’20): Link

Journal (): Link

Summary

This paper propose BFT-Store, an EC-based permissioned blockchain system to reduce the storage overhead over the traditional replication-based blockchain. It integrates EC and PBFT (Practical BFT for permissioned blockchain) consensus algorithm. It also considers scaling in blockchain with a re-encoding based approach.

Main Contributions

Details

Challenges
- When n = 3f + 1, how to ensure at least 2f + 1 honest nodes will receive valid EC chunks?
- How to scale the blockchain when nodes join or leave?
- How to ensure data read performance when data are erasure-coded?
Design
- Cross-block coding with RS(k,m) = (n - 2f, 2f): each set of n - 2f data blocks are grouped to generate 2f parity blocks, which collectively form a coding group;
  - Reason to choose (n - 2f, 2f): to ensure BFT (based on PBFT protocol)
- BFT-Store hybrids EC with replication to ensure data read performance, at the cost of higher storage overhead (but still lower than full replication)
  - Each block is replicated for c copies, and distributed to a specific set of nodes
  - For N nodes, the overall storage overhead is reduced from O(N) to O(c) (the paper says the storage overhead is reduced to O(1), but I think this is wrong)
- Encoding:
  - On each node:
    - Collect n - 2f data blocks
    - Encode to generate 2f parity blocks
    - It ensures that each block is replicated for c times in a coding group
- Read:
  - Step 1: Check if a local systematic read is feasible (the block is replicated in its local storage)
  - Step 2: Check if other honest (available) nodes store the (systematic) block
  - Step 3: Perform decoding: broadcast data read to n-f nodes and retrieve n - 2f blocks (it tolerates f Byzantine failed nodes) to perform decoding
- Scaling
  - Note: the workflow described in the paper is unclear and has techinical errors. Here, I describe the flow on my understanding
  - Overall: BFT-Store performs a re-encoding based approach for scaling
  - Assume additional nodes are added to the storage system, where there are n’ new nodes, and it tolerates f’ BFT failures.
  - Step 1: Leader node (selected in a round-robin fashion) sends a decode message to all other backup nodes storing EC data
  - Step 2: Backup nodes send chunk-set messages to the Leader node for decoding. The chunk-set messages include ALL the locally stored chunks.
  - Step 3:
    - 3-1: For each coding group, the Leader node collects n - 2f blocks and performs decoding to recover the original n - 2f data blocks.
    - 3-2: It sends a collection of new n’-2f blocks to all n’ - 1 backup nodes to perform (n’ - 2f’, 2f’) encoding
  - Step 4:
    - 4-1: Leader sends re-encode message to all Backup nodes (which chunks to collect)
    - 4-2: Each Backup node performs encoding based on the new set of blocks and deletes its originally stored data from the orignal coding scheme.
  - Step 5: Each Backup node sends re-encoding-finish message to Leader
  - It ensures that the blocks remains available during scaling operations
Implementation
- It implements over Tendermint (PBFT-based protocol for consensus)
Evaluation
- For each data point, it assumes a fix set of nodes (and does not vary the number of nodes at run-time)
- It compares storage savings with full-replication based scheme
- It compares read latency with full-replication based scheme

Strength

First work to integrate EC and PBFT consensus for permissioned blockchain
The work clearly describes the encoding and decoding operation, and it’s clear and it can reduces the storage overhead from O(n) to O(c)

Weakness

The design of scaling is unclear and have technical errors
- First, the decoding of ALL stripes is performed on one elected leader node. It’s a centralized approach and it’s definitely the performance bottleneck.
  - All backup nodes send the locally stored blocks to the leader node to perform decoding. As each block is replicated for c times, then the leader node will receive c times of stripes and the performance must be bad
  - BFT-Store does NOT evaluate the scaling performance (e.g, what is the scaling performance overhead when the number of nodes is increased/decreased at run-time)
- Second, how the data blocks of newly generated stripes of (n’ - 2f’, 2f’) are placed is not described. It assumes that all original blocks of the previously coded blocks are removed, and all new stripes are generated from scratch. It means that after scaling, ALL data are written to BFT-Store from scratch. It must have a bad performance.