Keyun Cheng

Reading Notes: FAST’24 COLE

Title: COLE: A Column-based Learned Storage for Blockchain Systems

Conference (FAST’24): Link

Journal (): Link

Summary

COLE is a blockchain-based storage that leverages learned index and LSM-tree to reduce the storage overhead introduced by Merkle Patricia Trie (MPT). It puts the values of blockchain states in a column-based manner, where the values with the same address (sorted by block ids) are sequentially stored to reduce the write overhead. Compared with the MPT based Ethereum-based blockchain with KV-store storage backend, COLE achieves significant storage saving while having a better throughput.

Main Contributions

It leverages learned index and LSM tree to reduce the storage overhead for indexing incurred by MPT.
- Directly applying Learned indexing to Merkle Tree introduces even higher storage overhead and degraded write performance

Details

Merkle Patricia Tree
- A combination of a Merkle tree (hash-based tree for integrity verification) and Patricia trie (very similar to Radix Tree, where the storage overhead is further reduced)
- The problem is that MPT has a high storage overhead due to the duplicated index structure (Fig. 1) for multiple block states.
  - The index structure cannot be removed as blockchain needs to support provenance queries
- Data structure for indexing the Ethereum transactions and states.
- Supported query types:
  - Get(): get the latest state from the input address
  - Put(): insert the state with the value of the input address to the current block
  - ProvQuery(): get history results for an input range of blocks
  - VerifyProv(): verify through hashing that the results returned by ProvQuery() are valid.
Idea:
- Learned index can reduce the storage costs and query times in database storage (compared with the conventional B+ tree that has O(logf_n) depth) and O(n/f) nodes for indexing.
- LSM tree are used to reduce the write overhead of the learned index. By following the column-based DB design, COLE sequentially appends the latest state values to the same state address for faster provenance enquiry results.
- The first level (with the latest block states) are stored in the first level of LSM tree; the latter levels are compacted to the disk (called disk levels) through async compaction to reduce the write latency.
Column-based storage
- Keys are compound keys (address, block_height)； compound keys are represented by 64 bit big numbers
- Values are contiguously stored in the same level for the same address (sorted by block height); hence, the MPT’s storage overhead is reduced (from internal nodes)
- To retrieve a value, COLE searches the MBTree first and then go to disk level with learned indexes
LSM-tree structure
- First level: Merkle B+Tree
- Next levels: on Disk with learned index
Write flow
- Insertion of a state kv pair for the latest block
  - The write flow is very similar to LSM-tree write operation, where the compound key (address, block height) is first write to in-memory L0 layer (MBTree), then flush to disk in subsequent on-disk levels when the previous levels are full (refer to Fig. 4)
- Index construction
  - Using a learned model referred from PGM-index
    - It’s a linear model based on convex hulls, where the disk page sizes are model parameters (such that the models are generated in a disk-friendly manner)
    - Given a linear model, COLE can predict the location of the compound key’s position in a file
    - The state’s compound key and position are treated as the location on a 2-D plane
    - The model updates the convex hulls on the 2-D plane in a streaming-based manner; the learned model builds minimal parallelogram based on the convex hull for prediction
    - For each layer (memory and disk), an index file based on learned index are created
- Merkle file construction
  - The Merkle file only includes the value files, but not the index files
  - The Merkle files are generated in parallel for all layers and stored in multiple Merkle files to increase the write throughput
- A discussion
  - COLE does not support blockchain forking and designed to work with non-forking blockchains
    - Ethereum by default adopts a fork-based design, when multiple miners has done over the same set of transactions and broadcast over the networks
- To address the write stall problem for write operation (due to the recursive merge operation in Algorithm 1), it proposes an async merge algorithm
  - Reason: some straggler nodes can be out-of-sync
  - High level idea: two stages of checkpoints, (1) start; (2) commit
  - Ensure that (1) is sync across nodes
Read flow
- Optimization technique: apply bloom filter to both MBTree (memory) and Learned indexes (disk) search the keys
- Get query: very similar to LSM-tree get (in memory with MBTree -> on disk with learned index)
- Provenance Query: Similar to MPT, but without the learned index (as they are not constructed with index files)
Evaluation
- Single machine; SmallBank for account transfers; YCSB for KVStore get/put operations

Strength

A natural combination of Learned index + LSM-Tree + Blockchain states