Failure Prediction of Hard Disk Drives in Redundant Arrays Using Disk-Level Performance Metrics

Although many hard drive failure prediction methods utilize Self-Monitoring Analysis and Reporting Technology (SMART) features


INTRODUCTION
A massive amount of data is generated in various systems.The total amount of new data generated in 2025 is forecasted to reach 175 ZB, which is five times larger than that in 2018 (Gantz, Reinsel, & Rydning, 2019).Storage devices are the key component to accommodate the large volume of data, and Hard Disk Drives (HDDs) will remain as the primary storage devices because of cost advantage.
In order to prevent data loss, a disk array consisting of multiple HDDs is virtualized with Redundant Array of Independent Disk (RAID) in enterprise IT systems.There are different levels of RAID to fulfill different needs of storage sys-Masanao Natsumeda et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.tems.For example, RAID-10 yields higher durability than RAID-6, but RAID-10 results in lower capacity than RAID-6.Although RAID typically increases durability, scenarios resulting in data loss remain.If the number of failed HDDs exceeds the limit for recovery, data loss occurs.Due to data security concerns, when a failure occurs in a disk array, other still functioning HDDs in the same disk array are replaced with new ones as preventive measures.
HDD failure prediction is a promising technology to reduce the running costs of IT systems and environmental load by the preventive measures since it allows us to replace HDDs only when necessary.Most HDD failure prediction methods (Aussel et al., 2017;Featherstun & Fulp, 2010;Ganguly et al., 2016;Hamerly, Elkan, et al., 2001;Li et al., 2014;Murray, Hughes, & Kreutz-Delgado, 2003;Yang et al., 2020;Zhang, Huang, Zhou, Xie, & Schelter, 2020) rely on Self-Monitoring Analysis and Reporting Technology (SMART) features.However, due to demanding performance requirements, SMART features are not always available in enterprise IT systems.On the other hand, two types of data representing HDD's status, i.e., disk-level performance metrics collected by RAID controllers and error codes from HDDs, do not require additional workload on disks for the collection.Therefore, HDD failure prediction utilizing them is applicable to a wide range of enterprise IT systems.An example of the error code sequence from an HDD is shown in Table 1.It shows error codes were issued before the HDD failed, indicating they capture disk degradation.
HDDs in the same disk array have logical relationships since they are virtualized as a single logical unit.The relationships appear as similar patterns in some of the disk-level performance metrics.For example, either RAID-1 or RAID-10 simultaneously write a pair of HDDs, and that leads similar pattern in their metrics related to writing data.Time series of average write response time, one of the disk-level performance metrics, from two HDDs in the same disk array are shown in Fig. 1.The HDD of the upper chart failed at the right-most of the chart, and the other remains normal.Both of the series similarly increase at the same time, but the in-1 4th Asia Pacific Conference of the Prognostics and Health Management, Tokyo, Japan, September 11 -14, 2023 R06-01 In order to predict HDD failures without SMART features, this study proposes a two-step unsupervised anomaly detection method utilizing the logical relationships of disk-level performance metrics collected by RAID controllers and error codes from HDDs.The proposed method computes anomaly scores for each HDD.Then it adjusts the scores in the same disk array based on themselves, suppressing false positives because of sudden changes in disk-level performance metrics.
In addition, the proposed method only detects an anomaly in an HDD when error codes from the HDD are issued in advance.This error-code-based filtering further reduces false positives since the error codes provide conclusive evidence for disk degradation.When the proposed method detects an anomaly in an HDD, it is regarded as an early indication of a failure, and the replacement of the HDD is recommended.

RELATED WORK
HDD failure prediction has been studied for decades, and the studies mainly rely on SMART features.Hamerly et al. (2001) proposed two Bayesian approaches to the prediction, viewing the problem as anomaly detection.In addition, this study utilizes error codes and logical relationships instead of physical relationships, i.e., location, for better prediction.

METHODOLOGY
The proposed method consists of RAID-configuration-based anomaly detection and error-code-based filtering.Its entire pipeline is shown in Fig. 2. RAID-configuration-based anomaly detection computes relational anomaly scores indicating the abnormality and detects an anomaly when it exceeds its pre-determined threshold.When an anomaly in an HDD is detected, the error-code-based filtering checks if any error code has been issued to the HDD before.If it has been issued, the detection is regarded as negative, and the proposed method continues monitoring the HDD.
Since the main concern of this study is whether the logical relationships of RAID help improve HDD failure prediction, kNN is used due to its simplicity; however any unsupervised anomaly detection method is applicable to the individual scoring function.
The relational scoring function computes distances of the individual scores from their mean at the same time stamp as relational anomaly scores.Given a disk array, let N be the number of HDDs in the disk array and a In be the individual score of the nth HDD.The relational anomaly score a Rn is defined as: a Ii . (2)

EXPERIMENTS
We evaluate the proposed method on a private dataset and compare it to its three variants to examine the benefits of adjusting individual scores based on RAID configuration and error-code-based filtering.The first variant does not employ error-code-based filtering.The second variant only relies on individual scores by kNN.The third variant only relies on error codes.The first two variants predict HDD failure based on anomaly scores.On the other hand, the third variant predicts HDD failure when an error code is issued.The third variant does not consider the category and occurrence frequency of the codes for prediction since the proposed method does not consider them for filtering.

Dataset
The dataset consists of time series of disk-level performance metrics and error code sequences of HDDs used in a data center.All the HDDs are Small Computer System Interface (SCSI) devices.The disk-level performance metrics used in this experiment are shown in Table 2.The total number of metrics is 10.Their readings were collected every three minutes.Training data consists of time series of the metrics from 19 normal HDDs.The HDDs were sampled from different disk arrays of RAID-6.The data collection period for each HDD ranges from 13 to 15 days.Testing data consists of time series of the metrics and error code sequence from 43 disk arrays of RAID-6 and 106 disk arrays of RAID-10.Since HDD failures rarely occur, the testing data collected for a longer period than that of the training data.The data collection period is 4.5 months.The total number of HDDs used in the disk arrays of RAID-6 and RAID-10 are 862 and 1660, respectively.The number of failed HDDs in the disk arrays of RAID-6 and RAID-10 are two and seven, respectively.

Experimental setting
The HDD prediction is evaluated segment basis rather than time-stamp basis.The time series of the metrics are divided into time series segments at the time of failure as shown in Fig. 4.Each segment is regarded as a sample for evaluation.
The number of nearest neighbors K is set to 10. Area Un- der the Curve (AUC), precision, recall, and F1-score are used as evaluation metrics.Let T P be the number of true positive, F P be the number of false positive, T N be the number of true negative, F N be the number of false negative, True positive rate T P R is defined by: It is the same as recall r.False positive rate F P R is defined by: Precision p is defined by: p = T P T P + F P .
(5) F1-score f 1 is defined as: AUC is the area underneath the entire Receiver Operating Characteristic (ROC) curve.Suppose T P R is a function of F P R. AUC is computed as: The same value is used as the threshold for each of the HDDs.Due to the small number of failures, it is difficult to determine an appropriate threshold.Thus we report the theoretical best values of the evaluation metrics, adjusting the threshold for each evaluation metric on the testing data.Remaining time from the detection time until the failure is also evaluated for each failed HDD.For this evaluation, the threshold is set to give the maximum precision subject to 100% of recall.

Results
Performance comparison of different methods of computing anomaly scores is shown in Table 3.The proposed method is denoted by 'relational + log'.Its variant without error-codebased filtering is denoted by 'relational'.The method only relying on individual scores by kNN is denoted by 'individual'.The proposed method shows the best results for all the evaluation metrics.The performance improvement by the unique components in the proposed method is consistent over different evaluation metrics.The proposed method performs better than its variant without error-code-based filtering.The variant without error-code-based filtering performs better than the method relying only on individual scores.
To better illustrate the consistent performance improvement, ROC curves of different methods on RAID-10 and RAID-6 are shown in Fig. 5 and Fig. 6, respectively.The consistent performance improvement is attributed to a consistently   The performance comparison of the proposed method and its variant only relying on error codes is shown in Table 4.This table also indicates the effect of disk-level performance metrics on failure prediction.Since error codes appear before failures for all the cases, precision at 100% recall is compared.The proposed method shows the best results for both RAID levels, and the precision is more than twice that of the variant, indicating the positive effect of the disk-level performance metrics.Distribution of remaining time from the detection time until the failure is shown in Table 5.Seven out of nine failures are predicted one day before the failure at the latest.This result suggests that the proposed method provides plenty of time for HDD replacement before a failure occurs.
Table 5. Distribution of remaining time from the detection time until the failure.

CONCLUSION
This study has proposed a two-step unsupervised anomaly detection method utilizing the logical relationships of disk-level performance metrics collected by RAID controllers and error codes from disks instead of SMART features.Evaluation of the proposed method on the private dataset has verified the benefits of incorporating RAID configuration and error codes from HDD into HDD failure prediction.The proposed method leaves plenty of room for improvement.For example, a more complex method is applied to compute individual scores instead of kNN, and the error-codes-based filtering incorporates the category and occurrence frequency of the codes into the prediction.Although further evaluation of the proposed method with a more significant amount of data is necessary due to the small volume of data for evaluation, we hope that our observations from experiments facilitate a new research direction of HDD failure prediction, not relying on SMART features.

Figure 2 .
Figure 2. The entire pipeline of the proposed method.Detection results by RAID-configuration-based anomaly detection is updated by error-code-based filtering.

Figure 4 .
Figure 4. Sample generation for the segment-basis evaluation.The horizontal axis shows the entire period of testing data collection.Each time series segment divided by the time of failure forms a sample.This example generates four samples.

Table 1 .
An example of the error code sequence from an HDD.
Li et al. (2014)003)compared the performance of Support Vector Machines (SVMs), unsupervised clustering, and non-parametric statistical test.Li et al. (2014)employed Classification and Regression Trees for accurate, stable, and interpretable pre- Lu et al. (2020)20)lure.It was evaluated on a real-world dataset containing 25,792 drives.Zhang et al. (2020)addressed unsatisfactory results because of a small amount of data during training or the disks which have not appeared in training, employing a Long Short-Term Memory (LSTM) based siamese network.Featherstun and Fulp (2010) used all the data collected by syslog, a standard Unix logging facility.The data contains SMART features as well.Ganguly et al. (2016) combined multiple data sources, i.e., SMART features and Windows performance counters.Aussel et al. (2017) evaluated machine learning models on a large-scale and heterogeneous dataset from over 47,000 HDDs with 81 models from 5 manufacturers.Lu et al. (2020)performed large-scale disk failure analysis based on 380,000 HDDs distributed across 64 data center sites.They collected SMART features, disk-level performance metrics, server-level performance metrics, and disk spatial location data.The analysis was conducted with supervised methods for HDD failure prediction.The result shows that the supervised methods yield competitive performance without SMART features.Unlike the above studies, this study explores the feasibility of unsupervised HDD failure prediction without SMART features.

Table 2 .
The disk-level performance metrics used in this experiment.

Table 3 .
Performance comparison of different methods of computing anomaly scores.P, R, and F1 represent precision, recall, F1-score, respectively.The bold letters indicate the best results.

Table 4 .
The effect of disk-level performance metrics.Precision at 100% recall is compared.The bold letters indicate the best results.