Although many hard drive failure prediction methods utilize Self-Monitoring Analysis and Reporting Technology (SMART) features, they are not collected in IT systems with demanding performance requirements to meet their specification. We present a novel data-driven method for the prediction utilizing disk-level performance metrics collected by Redundant Array of Independent Disk (RAID) controllers instead of SMART features. The proposed method computes relational anomaly scores leveraging logical relationships of Hard Disk Drives (HDDs) based on RAID configuration for better failure prediction. In addition, it further utilizes error codes from HDDs to filter out false positives. We evaluate the proposed method on a real-world dataset collected for this study from 881 disks used in disk arrays of RAID-6 and 1660 disks used in disk arrays of RAID-10 in a data center. The results show consistent performance improvement by the logical relationships and error-code-based filtering. In addition, seven out of nine failures are predicted one day before the failure at the latest. This result suggests that the proposed method provides plenty of time for HDD replacement before a failure occurs.
Machine learning, Anomaly detection, Hard disk drives, RAID
Aussel, N., Jaulin, S., Gandon, G., Petetin, Y., Fazli, E., & Chabridon, S. (2017). Predictive models of hard drive failures based on operational data. In 2017 16th ieee international conference on machine learning and applications (icmla) (pp. 619–625).
Featherstun, R. W., & Fulp, E. W. (2010). Using syslog message sequences for predicting disk failures. In Lisa.
Ganguly, S., Consul, A., Khan, A., Bussone, B., Richards, J., & Miguel, A. (2016). A practical approach to hard disk failure prediction in cloud platforms: Big data model for failure management in datacenters. In 2016 ieee second international conference on big data computing service and applications (bigdataservice) (pp. 105–116).
Gantz, J. F., Reinsel, D., & Rydning, J. (2019). The us datasphere: Consumers flocking to cloud.
Hamerly, G., Elkan, C., et al. (2001). Bayesian approaches to failure prediction for disk drives. In Icml (Vol. 1, pp. 202–209).
Li, J., Ji, X., Jia, Y., Zhu, B., Wang, G., Li, Z., & Liu, X. (2014). Hard drive failure prediction using classification and regression trees. In 2014 44th annual ieee/ifip international conference on dependable systems and networks (pp. 383–394).
Lu, S., Luo, B., Patel, T., Yao, Y., Tiwari, D., & Shi, W. (2020). Making disk failure predictions smarter! In Fast (pp. 151–167).
Murray, J. F., Hughes, G. F., & Kreutz-Delgado, K. (2003). Hard drive failure prediction using non-parametric statistical methods. In Proceedings of icann/iconip.
Yang, Q., Jia, X., Li, X., Feng, J., Li, W., & Lee, J. (2020). Evaluating feature selection and anomaly detection methods of hard drive failure prediction. IEEE Transactions on Reliability, 70(2), 749–760.
Zhang, J., Huang, P., Zhou, K., Xie, M., & Schelter, S. (2020). Hddse: Enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In Proceedings of the 2020 usenix conference on usenix annual technical conference (pp. 111–126).
This work is licensed under a Creative Commons Attribution 3.0 Unported License.