Keyun Cheng

Toward Adaptive Disk Failure Prediction via Stream Mining

Download

Summary

This paper presents StreamDFP, a general stream mining framework for disk failure predition with concept-driven adaption.

Problems to solve:

Online labeling
Concept-drift aware training. The system should detect and adapt to concept drift in training.
General Prediction. Regression: likelihood the a new disk will fail. Classification: Whether a new disk will fail.

Details

Regards disk failure prediction as a stream processing/mining problem, which is online
Datasets: SMART datasets, including Backblaze dataset and Alibaba Cloud dataset.
Complete design, from disk logs to prediction. Python + Java, around 2000LoC.

Concept Drift: the relationship between the input and output continuously changes over time. Which should be p(y_t

x_t). Solution: Change detection

Learning Algorithms. Commonlyused decision tree, ensemble learning algorithms are used.
Studied the concept drifts p(y_t x_t) by measuring p(x_t) and p(y_t). Conclusion: the concept drift likely exists.
Architecture:
- Python: feature extraction, buffering, online labeling, first phase downsampling. Output of processed data will be stored into a local file system.
- Java: Second phase downsampling, prediction model (incremental learning).

Strength

Enabling concept-drift adaption increases classification accuracy for different learning algorithms.
Online labeling improves the overall accuracy.
Compatibility of Regression and Classification.
Speed viable for pratical stream processing usage.
Validation in Alibaba Cloud dataset, which is large

Weakness

What’s the advantage of StreamDFP compared with its related work[43] ORF? And what about the performance comparison between the two works? Speed and accuracy? ORF method focused on aging issue in online learning method, but this paper’s work changes the perspective, it viewed the workflow as a data stream. What’s the difference?
How about other datasets? Are those datasets available? (Needs to figure out)