An Introduction to 2023 PHM Data Challenge: The Elephant in the Room and an Analysis of Competition Results

The trend in diagnostics and prognostics for PHM is shifting toward explainable data-driven models. However, complex engineered systems are typically challenging to develop entirely explainable models for, whether they are grounded in physics or data-driven techniques. Consequently, the development of machine learning models, including hybrid variants capable of both interpolation and extrapolation, holds significant promise for enhancing the practicality of system simulation, analysis, modeling, and control in industry. The primary objective of this data challenge is to encourage contributions that expand the scope of model generalization beyond the training domain. The second aim of this data challenge is to quantify model uncertainty and methods to incorporate it into predictions. For most PHM tasks, clear guidance of the required action is ideal. To issue a definitive guidance to end users, it is useful to quantify uncertainty for the whole model. This data challenge addresses both estimation and uncertainty.


OBJECTIVE
This year's data challenge focuses on estimating gearbox degradation levels in a gearbox operated under a variety of conditions.Participants are scored based on both accuracy and confidence derived from estimated uncertainties in estimation.

DATA CHALLENGE TASK
Although a challenging task for many data-driven approaches, to be practical for real-world applications, a model should generalize to previously unseen operational conditions and fault levels.Participants are also required to express measures of confidence in model predictions.Such confidence measures might be used to determine whether these predictions can be trusted or not before taking any downstream actions.
The overall data challenge task is to develop a fault severity estimate using the data provided.The training dataset includes measurements under varied operating conditions from a healthy state as well as six known fault levels.The testing and validation datasets contain data from eleven health levels, which include a healthy state and 10 degradation/fault levels.Data from some fault levels and operating conditions are excluded from the training datasets to mirror real-world conditions where data may only be available from a subset of the operating envelope.The training data are collected from a range of different operating conditions under 15 different rotational speeds and six different torque levels, while the test and validation operating conditions span 18 different rotational speeds and six different torque levels.
The data challenge requires fault level estimation for three regimes of the operational envelope: 1. Samples from conditions seen in the training dataset.
2. Samples from conditions not seen in the training dataset, but within the range of operational conditions and fault levels seen in the training set, i.e., interpolation.
3. Fault level estimation from conditions not seen in the training dataset and outside the training range for fault levels, i.e., extrapolation.
Both a fault level estimate and a corresponding confidence level are required from the model.Such confidence may be used in deciding whether a prediction should lead to an action reconfiguration, (inspection, repair, etc.) or no action if the confidence was below pre-decided acceptable threshold.In real settings such thresholds would be determined based on operational risks and business models, however, this challenge requires participants to focus on developing methods to assess confidence in their models and implicitly learn thresholds such that overall accuracy can be maximized.Accuracy calculation with incorporating confidence is explained in Section 4.

Experimental setup
A brief overview of the data collection process is provided here.Full details are provided in the papers referenced (Li, Qu, andNichifor, et al., 2018-2022).
The gear pitting experiments were performed on a one-stage gearbox installed in an electronically closed transmission test rig.The gearbox test rig includes two 45 kW Siemens servo motors.One of the motors can act as the driving motor while the other can be configured as the load motor.Motor 1 is the driving motor in this experiment.The overall gearbox test rig, excluding the control system, is shown in Fig. 1.The testing gearbox is a one stage gearbox with spur gears.The gearbox has a speed reduction rate of 1.8:1.The input driving gear has 40 teeth, and the driven gear has 72 teeth.Detailed gear parameters are provided in Table 1.
A tri-axial accelerometer was attached on the gearbox case close to the bearing house on the output end as shown in Figure 3. X, Y, Z are horizontal, axial and vertical, separately.One or more gear teeth are manually degraded using a drill bit through the lube oil cover without any disassembly and assembly of the gearbox or test rig.Degradation severity increases in levels from 0 to 10.
To sample enough data points in terms of revolutions, longer time series data are collected for lower rotational speed conditions.
For 100-200 rpm, the sampling time is about 12s; for 300-1000 rpm, the sampling time is about 6s; for 1200 rpm and above, the sampling time is about 3s.
All data are sampled with a sampling rate of 20,480 Hz. horizontal, axial, and vertical accelerometers were sampled separately, along with a tachometer signal.All signals were time synchronized.
The tachometer (laser reflective tachometer) outputs one pulse per revolution.This is measured on the output shaft, which is 5/9 of the input shaft speed.For example, for input shaft speed at 35Hz and output shaft speed will be 19.44 Hz.

Data Description
Out of all data marked as black (78 operating conditions throughout 7 health levels), on average, 3.69 repetitions for each operating condition of each fault level are included in the training data set.A total of 2016 data files were included in the training.Pitting degradation levels 5, 7, 9, 10 are omitted from the training data set.

EVALUATION METRICS
For the submission of data challenge results, a probability based prediction is required for each predicted label.The probability can be distributed across multiple labels, with a sum of probability equal to or less than one.A binary confidence level is required to be included for the label classification & prediction for each sample, with 0 and 1 mean low and high confidence, respectively.The exact rewards or penalty also depends on the how far the predicted label is from the underlying true label as specified in the following: where,  !"!#$ is the total score, n is total number of testing/validation samples, &,*  &,* , with  &,* equals to the score for prediction of sample  at distance , while  !,# , is the reported probability the label at the distance  as shown in Table 4.For example, the ideal prediction  & will have a  &,* =100% at distance 0, which means  &,) = 1 for  = 0,  &,* = 0 for  ≠ 0. In this case, we have  & = 1, otherwise,  & < 1. Accordingly, the highest total score will be n depending on the testing and validation data size.In this data challenge, the highest possible testing score is 800, and highest validation score is 812.
A high confidence level will be scored with a higher weight for the final sum of score, while a low confidence level will be scored with a lower weight.Similarly, a wrong prediction with a high confidence level will also be graded with a higher penalty.

SUMMARY
A total of 52 teams registered and 20 teams completed the data challenge.The final scores of all twenty are summarized in Table 5. Top ten teams were invited to submit a brief description of the technical approach taken.A panel of experts evaluated the summaries independently on criteria of data-preprocessing steps, algorithmic novelty, treatment of uncertainty, and creativity.A final score incorporating test set performance and method scores was used to identify top five finalists.Winners will be chosen from finalists based on conference presentations on their detailed approach and discussions.
For readers' reference, the following list indicates different approaches adopted by five teams out of the top ten finalists.They are the ones who registered for the conference for presentation and submitted their summary to the conference proceedings.Further details of the methodologies can be found in data challenge summary papers included in the PHM 2023 conference proceedings: 1. nivic: Gear Pitting Fault Diagnosis using Domain Generalizations and Specialization Techniques (Chu et al., 2023) 2. Thumper: Interpolate and Extrapolate Machine Learning Models using An Unsupervised Method (Liu, 2023) 3. KUL: Predicting pitting severity in gearboxes under unseen operating conditions and fault severities using convolutional neural networks with power spectral density inputs.(Vaerenberg et al., 2023) 4. Amitory: Anomaly Detection and Fault Classification in Multivariate Time Series Using Multimodal Deep Models.(Ryu et al., 2023) 5. zwang1916: Gearbox Degradation Prediction through Deep CNN and Bayesian Optimization.(Shen et al., 2023) At the time when the data challenge was closed, the highest testing score was 463.5, and the highest validation score was 472.A further analysis on the validation score of 472/812 reveals that this score corresponds following performance: for machine learning metric precision at k, precision at 1 = 66.38%, precision at 2 = 86.70%,and precision at 3 = 98.15%.That means the top performing team has correctly predicted 98.15% of the sample with a label within the error distance of 2, which is a very impressive result.

Figure 1 .
Figure 1.Experiment test rig for gearbox dynamic meshing stiffness analysis Both healthy and gradually pitted gear under various operating conditions were tested and the vibration signals collected.Five sets of data were collected.Symbol '•' indicates that the data samples for this setting are provided for training while '∘' indicates the data are hidden from training but will appear in testing and validation.

Table 1 .
List of gear parameters for the tested gearbox

Table 2 .
Operation conditions of the experiments (low speed)

Table 3 .
Operation conditions of the experiments (median to high speed)

Table 4 .
Prediction score for each sample based on the distance from the true labels.

Table 5 .
PHM 2023 Conference Data Challenge Final Scores