Reading Notes: FAST’19 HeART
Title: Cluster storage systems gotta have HeART: improving storage efficiency
by exploiting disk-reliability heterogeneity
Conference (FAST’19):
Link
Journal (): Link
Summary
This paper introduces HeART, an online tuning tools for redundancy transition
by observations of disk reliability status. HeART significantly reduces the #
of disks (storage savings) with the same reliability levels (MTTDL) compared
with one-scheme-for-all and replications.
Main Contributions
Details
- Goal: reduce storage overhead by replacing one-fits-all schemes with
adaptive redundancy rate
- The reason why AFR comes into play
- Analysis of AFR
- Bathtub curves of AFR for six disk models are shown
- How to use AFRs?
- HeART groups disks with similar AFRs and a tailor-made scheme for
different groups
- How to measure disk reliability? MTTDL
- One interesting thing: multiple redundancy schemes can achieve same
MTTDL, but long schemes (high k) with high reconstruction costs -
there is a tradeoff how to select redundancy schemes?
- Challenges that HeART try to addresses:
- Online redundancy transition for different groups
- time required, I/O (not considered)
- Accurately detects AFRs for different groups
- How to filter out abnormal AFRs?
- Designs
- Online change point (bathtub point change significantly) detection
- Should use prior information about HDDs at start
- Standard sliding-window based change point detector
- Periodically trigger point detection
- Abnormally detection of reliability data stream from HeART (to filter
out the abnormal data): use RRCF algorithm (I didn’t look into detail)
- Evaluation
- Over Blackbaze dataset of 6 disk groups
- Mostly focus on the functionalities
- Identify the useful life periods -> MTTDL
- How HeART suggests the coding schemes with best storage savings?
- Start from (9,6) -> depend on disk AFRs among groups -> also
short codes (like (24, 20) and (21, 17))
- Sensitivity analysis
- how the system judges the flat period of disk groups?
- AFR buffer: the period (or padding called in the paper) added after
the infancy of a disk life (in the baththub curve)
- Larger buffer size: conservative for the low part of AFR
Strength
- Probably the first system work on the disk AFR analysis (I don’t see others
for now)
- which gives strong evidence for the motivation of redundancy
transition: disk failure rate highly varies and the baththub curve
suggests that we can adapt the redundancy rates with convertible
codes
- HeART can suggest how to adapt the redundancy scheme with init code (9,6)
- It’s based on the analysis disk’s real status (AFR) instead of man-made
- The designs are mostly non-system
- Based on reliability analysis (MTTDL)
Weakness
- What’s the role of HeART in real systems? This paper suggests that the
approach is online, but didn’t show the exact performance of HeART in real
systems as a reliability adaptor.