Condition Monitoring of Slow-speed Gear Wear using a Transmission Error-based Approach with Automated Feature Selection

Gear flank changes caused by wear do not only affect the dynamic behavior of gear systems, but they can also compromise the load-carrying capacity of gear teeth up to critical failure. To help avoid unintended consequences like downtime or safety risks, a condition monitoring system needs to be able to estimate the current wear during operation based on available sensor measurements. While many condition monitoring approaches in research rely on vibrational analysis with manual feature engineering, gearboxes running at slow speed do not reveal much excitation information for this purpose. We therefore introduce an approach for slow-speed gear wear monitoring that is based on the dynamic gear transmission error and that contains an automated feature selection process. For this purpose, we extract a large set of features from the preprocessed transmission error samples. Applying combined filter and embedded feature selection methods enables us to automatically identify and remove features with low relevance. The selection process consists of filtering features with no statistical dependence on the target wear value, removing redundant features with a correlation analysis and a recursive feature elimination process with cross-validation based on a random forest regressor. The remaining relevant set of features is the basis for model training and subsequent wear estimation. For this, the present research employed two independent ensemble models, random forest regression and gradient boosted regression trees. To train and test the proposed approach, we conducted slow-speed gear experiments with developing gear wear on a single-stage spur gear test rig setup. The results of both models show good gear wear estimation performance compared to the actual wear mass loss, even for small quantities. Hence, the proposed transmission error-based approach with automated feature selection is able to quantify the degree of slow-speed wear and offers a possible way for condition monitoring and fault diagnosis.


INTRODUCTION
Due to the immense advances in information and sensor technology within recent years, machine learning algorithms together with growing computational power have improved numerous applications and offer many new possibilities. Applied to the field of machine condition monitoring, such approaches seem promising. A reliable condition monitoring system with high accuracy is the key for fault diagnosis and predictive maintenance strategies. It not only increases economic efficiency, but also ensures high safety for humans and machines. Therefore, it is necessary to acquire appropriate data, process and evaluate the data, and derive the machine's condition within a certain time and precision. When it comes to gears, slow-speed wear is usually a phenomenon that is undesirable in a larger extent. Gradual material abrasion from the tooth flank can change the dynamic behavior of the entire system and can affect the gears' load-carrying capacity up to critical failure. Therefore, a working condition monitoring system with regard to slow-speed gear wear can prevent such failures and helps to prepare adequate measures.
With the present work, we address condition monitoring possibilities with focus on gear wear at slow rotational speeds. This research aims at correlating the dynamic transmission error with the corresponding degree of accumulated slow-speed wear of pinion and wheel. Therefore, the idea is to combine a feature-based condition monitoring approach built on the dynamic gear transmission error with an automated feature selection process. While many current condition monitoring systems and research efforts rely on vibration-based analysis, low vibration excitation at slow operating speeds tends to provide not enough information for accurate slow-speed wear diagnosis. However, the latest sensor technology makes angular measurements of high quality possible. Consequently, we instead use the transmission error, which is very accurate at slow speeds and has a rapid reaction time to gear changes, as a basis for analysis. The present approach with a feature elimination process as part of feature selection, on the one hand allows, for interpretability in contrast to various deep learning approaches. On the other hand, it still requires an enormous number of features with a minimum amount of domain knowledge. The approach is validated on multiple gear tests with actual occurring slow-speed wear with detailed "run-to-failure" documentation.
With the present paper, we first provide a review of previous research. Secondly, we elucidate the methods used that contain the experiment setup, the data acquisition, data preprocessing, and the condition monitoring model building and deployment including feature extraction and selection. We then give an insight into the achieved results and discuss them in terms of slow-speed wear condition monitoring.

REVIEW OF PREVIOUS RESEARCH
Estimation and prognosis of machine health has seen extensive research in academia and industry. Approaches that include machine learning methods have especially gained traction in this past decade, as the application of these advanced methods is made viable with the release of open-source and commercially available software libraries. In specialist and academic literature, two different types of approaches utilizing machine learning for gear condition monitoring systems exist: • Methodologies with feature engineering • Methodologies with feature learning (deep learning) Typically, feature engineering approaches extract features from gear vibration signals. Many different types of feature candidates for this specific purpose have been proposed. These feature candidates can be classified mainly as time domain features, frequency domain features or timefrequency domain features. For gearbox fault diagnosis, for example, statistical features and higher order moments can be calculated from the raw, the derivative and the integral vibration signal as time domain features (Samanta, 2004). Examples for features extracted from the frequency domain of vibration signals are statistical metrics from individual frequency bands (Cerrada et al., 2016). Finally, timefrequency domain features can be calculated, for instance, with the continuous wavelet transform, the discrete wavelet transform, and the wavelet packet transform (Yan, Gao, & Chen, 2014). Other popular sources of time-frequency domain features are the short time Fourier transform, the Wigner Ville distribution and the Hilbert Huang transform (Soualhi et al., 2018). The features calculated by these signal processing techniques are usually hand-picked and require domain specific knowledge. Generally, these engineered features attempt to capture changes in frequency and amplitude modulation of the vibration signal emitted by the meshing gears. McFadden (1986) developed a mathematical representation of this behavior and in the course of the last few decades, i.e. Wu, Zuo, and Parey (2008) and Feng and Zuo (2012) refined this representation for different types of gearboxes. The extracted features may be used in combination with machine learning methods, for example, with artificial neural networks (Pacheco et al., 2016), support vector machines (Shen, Wang, Kong, & Tse, 2013), random forests and others (Han, Jiang, Zhao, Wang, & Yin, 2018) to diagnose faults in gears and rotating machinery. In some cases, the set of features and the hyperparameters for the machine learning algorithm are selected with the support of metaheuristics, i.e. genetic algorithms (Cerrada et al., 2016;Samanta, 2004).
Deep learning models have seen overwhelming success and enabled performance improvements for notoriously difficult problems in the field of computer vision and natural language processing. The use of these models is mostly in an end-to-end manner, thus making manual feature engineering unnecessary. Inspired by this kind of success, many researchers have introduced a deep learning workflow to various applications of condition monitoring. For example, Janssens et al. (2016) used a convolutional neural network to extract features automatically from vibration signals caused by rolling bearings to identify multiple different faulty bearing conditions. In this study, the authors show that compared to other machine learning methods that require feature engineering, e.g. support vector machines and random forest regressions, the convolutional neural network performs better. Zhao, Yan, Wang, and Mao (2017) took this idea further by introducing bi-directional LSTM (long-short term memory) layers in addition to convolutional layers. LSTM layers are used specifically for temporal modeling in artificial neuronal network architectures. The authors use the proposed network to predict the slow-speed wear condition of machine tools in milling machines. Lee, Kim, Kim, Hur, and Kim (2018) and Yang, Huang, Lu, and Zhong (2018) developed similar methods that incorporate LSTM layers. Approaches also exist that differ from the common combination of convolutional and LSTM layers, for example, by using sparse filtering for feature learning (Lei, Jia, Lin, Xing, & Ding, 2016). Methodologies to specifically model, simulate and determine wear and wear progression of gears exist in and out of context of condition monitoring and fault diagnosis. Choy, Polyshchuk, Zakrajsek, Handschuh, and Townsend (1996) presented an analytical approach that models influence of pitting, gear wear and tooth fracture to the vibration signal of gearboxes. The authors imply that this model may also be applicable for gear fault diagnosis. Kuang and Lin (2001) and Ding and Kahraman (2007) use a similar approach. Hu, Smith, Randall, and Peng (2016) developed specialized condition indicators to specifically quantify and assess gear wear based on vibration measurements. According to the authors, these condition indicators may be used in combination with traditional condition indicators and particle analysis. Furthermore, Fromberger et al. (2016) presented different condition monitoring methods for gear fault diagnosis with special focus on utilizing position encoders and transmission error determination. In a study conducted by Chin, Smith, Borghesani, Randall, and Peng (2021), gear wear is determined by assessing changes of the DC component in the transmission error signal. This approach therefore makes use of typically omitted components of the transmission error signal.
The current state of the art offers various approaches to monitor the condition of rotating machines and gears in particular. While the majority of research is based on vibrational data, only few studies consider the transmission error of gears, which seems especially suitable for consideration at slow speeds where the vibration excitation of transmissions is low. In addition, many datasets originate from analytical models for estimating gear wear, whereas actual lubricated slow-speed wear data is rarely used for analysis, although it is very relevant in practice. In terms of algorithms, feature engineering methods, particularly with automated feature selection, have proven to be suitable for various condition monitoring tasks, but their use for gear transmissions especially in combination with slow-speed wear lacks groundwork.
We therefore see the need to evaluate transmission error data from actual lubricated slow-speed wear tests with low amounts of wear in combination with state-of-the-art automated feature engineering and machine learning methods for slow-speed wear condition monitoring. While both feature engineering and feature learning approaches seem to be promising, we will focus on automated feature engineering, due to interpretability advantages and the properties of the dataset available.

METHODS
With the goal in mind to identify and quantify gear damage, particularly slow-speed wear, we developed an approach and validated it with gear tests. The following sections contain the experiment setup, data acquisition, data processing and condition monitoring methods.

Experiments and Slow-speed Wear Data
For validation, we conducted gear experiments and measured vibrations together with the angular positions of the gear shafts. This allows us to calculate the dynamic transmission error, which is the basis for further processing and for methods presented in this paper in terms of condition monitoring and health evaluation.

Test Rig
For the experiments of slow-speed wear investigation, we used a test rig based on a back-to-back principle according to DIN ISO 14635-1 (2006) (see Figure 1).
The test rig contains a single stage test gearbox with two mating gears, a test pinion (1) and a test wheel (2). The electric motor (5) drives the gearbox via the input shaft (4). Moreover, the test gearbox is part of a mechanical power loop, including a further shaft (7) and a slave gearbox (8) with an identical transmission ratio as in the pair of test gears. The load-carrying clutch (6) enables twisting and subsequent fixing of the shaft (7). Thus, it is possible to apply torque moment to the shafts, thereby loading the gears with a predefined moment. To make sure that damage occurs predominantly on the pair of test gears, the slave gearbox has a higher load-carrying capacity than the test gearbox, i.e. by designing the slave gears with a larger face width.
In addition to the FZG standard back-to-back test rig (see Figure 1), the test rig used here contains a reduction gearbox interconnected between the electric motor and the slave gearbox. This way, slow operating speeds of the test gears are possible. Figure 2 shows the pair of test spur gears used for the slowspeed wear testing. Further gear data are listed in Table 1. Figure 2. Test gear set.

Experiments
In total, we conducted three gear experiments, named experiment A, experiment B, and experiment C. The slowspeed wear experiments lasted for approximately 120 h each. Occasional pauses with disassembly and assembly after a defined number of load cycles were necessary to quantify wear. The amount of wear is the accumulated mass loss, determined by the weighing of pinion and wheel. The relative gear position of pinion and wheel did not change due to the weighing process and the lubricant remained in the gearbox. The operating pinion shaft speed for testing was 13.4 rpm, the corresponding pinion shaft load 627 Nm.

Slow-speed Wear Results -Ground Truth
Occurring slow-speed wear not only resulted in characteristic marks (see Figure 3), but also led to a mass loss of the gears. The accumulated wear for three different experiments (A, B, and C) with regard to their runtime are displayed in Figure 4 and in the following serve as the ground truth for the condition monitoring approaches. The distinct differences in the severity of the wear behavior of experiment A, B, and C fundamentally result from different oil or grease used for lubrication. We will not further investigate this lubrication influence in the present work, but refer to Siewerin et al. (2020) for more details.
With this paper, we instead focus on investigating the condition monitoring possibilities with regard to wear.

Condition Monitoring Workflow
The fundamental workflow we used to develop condition monitoring systems is depicted in Figure 5a). A crucial step in the condition monitoring workflow is model building; hence, it is further specified in Figure 5b). The methodology we developed uses a feature-based model building step, which requires feature engineering. In order to reduce the required amount of domain knowledge and human interaction typically associated with feature engineering, we used a selection process that automatically discovers promising feature candidates, as depicted in Figure 5c). This section aims to provide insights into each individual step and provides background information on the design choices that we made.

Data Acquisition
The first step in a condition monitoring workflow is data acquisition. Typical methods to acquire data from machines in operation are debris analysis, thermography, performance analysis and vibration analysis (Randall, 2011). The methodology of data acquisition described in the following captures torsional vibration and angular movement. Data acquisition culminates in a dataset consisting of time series that each contain a transmission error signal for a specific slow-speed wear condition of the meshing gears. For this task, we defined a time series containing transmission error as the predictor variable, while the corresponding wear condition is used as the target variable. Wear condition were quantified as a combined weight loss of both pinion and wheel (see Section 3.1.3), which is reasonable because of the same material and heat treatment.
For research in terms of condition monitoring, we added additional measurement equipment to the test rig described in Section 3.1.1. The input and output shaft of the test gearbox each contain an angle measurement sensor (see Figure 6). Therewith, it is possible to measure the absolute angle position of the input shaft as well as the angle position of the output shaft synchronously. The angular sensors work contactless based on optical acquisition. Furthermore, they come with a resolution of 32 bit. The sample rate used in the present experiments was 8.67 kHz.
The angular measurement took place in an intermittent manner every 20 minutes for 60 seconds throughout the entire experiment runtime. Figure 6. Test gearbox and measurement equipment.

Data Preprocessing
Subsequently, the acquired data is preprocessed. Section 3.1.3 contains the determination of the slow-speed wear target variable. We carried out five weight measurements during each experiment, which makes it possible to approximate the wear condition by linear interpolation. The continuous lines in Figure 4 show an approximation of how slow-speed gear wear evolves over time. From this point onwards, we used the relative combined weight loss instead of the absolute combined weight loss to improve comparability among experiments A, B and C. The unit that we used for relative combined weight loss is parts per million (ppm).
In a first preprocessing step, we used the synchronously measured angular data of the shafts to calculate the transmission error Δ ( ) of the pair of test gears (see Eq. (1), (Brecher, Gorgels, Hesse, & Hellmann, 2011;Niemann & Winter, 2003)) with 1,2 as the teeth numbers and Δ 1,2 ( ) as the rotation angles of pinion and wheel.
Because of the measurement location outside of the test gearbox, the transmission error includes the transfer path of the sensor to the corresponding gear. However, due to the high stiffness of the shafts compared to the gear mesh and the proximity of the sensors and gears' positions on the shafts, we do not see a considerable influence.
Preprocessing also includes outlier detection, removal and replacement (Aggarwal, 2017). The approach we employed to identify outliers is a modified version of the commonly used z-value test. These so-called robust z-values are calculated using Eq.
(2) (Pedregosa et al., 2011(Pedregosa et al., , 2020 Test gearbox Angular measurement instead of the mean in the numerator and an interquartile range (IQR) between the third quartile 3 and the first quartile 1 instead of the standard deviation in the denominator. These modifications ensure robustness towards outliers with values that differ from the rest of the signal by several magnitudes. Hence, individual signal values within a single signal that correspond to a robust z-value higher than 10 are unlikely to be caused by nominal test rig operation. Therefore, these values are subject to either removal or replacement, depending on the temporal location within the signal. This outlier removal procedure is only viable because we assume the underlying data generation process to be approximately stationary.
An exemplary amplitude spectrum of the unfiltered transmission error signal with annotations for selected characteristic frequencies is shown in Figure 7.   Table 3. Filter design parameters to remove low frequency components from the transmission error signal.
The exact values for these characteristic frequencies at nominal test rig speed are listed in Table 2. Based on the assumption that low frequency components in the transmission error signal do not contain relevant information about the wear condition of the meshing gears, we removed these frequencies with a digital high-pass filter. Table 3 contains appropriate filter design parameters to accomplish filtering. Both passband and stopband edge frequencies are dynamically adapted to accommodate minor speed fluctuations of the test rig in order to preserve critical information in the signal. Figure 8 shows an exemplary excerpt of the high-pass filtered transmission error from the recorded data for one of the tests. Due to the occurring slow-speed wear, the transmission error changes from the beginning to the end of the test. This change is the basis for the fault diagnosis approach presented in this paper. After filtering, we split each individual measurement signal into multiple sub-series. For meshing gears, multiple natural frequencies exist to divide the signal. Typical split frequencies coincide with those mentioned in Table 2 and their respective harmonics. The fundamental frequency of the transmission error signal matches the hunting tooth frequency ℎ , which is calculated according to Eq. (3) (Mark, 2015;Scheffer & Girdhar, 2004), with as the gear mesh frequency. For this, the greatest common divisor of the number of teeth on the pinion 1 and the wheel 2 needs to be determined. Successively, each measurement is split into single hunting tooth cycles in order to capture every possible tooth contact in a sub-series of a measurement.
Applied to the existing data, this step yields a total of 1,372 sub-series for experiment A, 1,384 sub-series for

Model Building and Model Deployment
We then used the preprocessed data for model building, which is arguably the most important step in the condition monitoring workflow. For this research project, we defined a model as an input-output mapping between the preprocessed time series containing the transmission error of the meshing gears and the corresponding wear. Generally, there are two different types of models to accomplish this: • Physics-based models • Data-driven models The key difference between the two is that physics-based models derive the target variable from a predefined set of rules, while data-driven approaches derive the rules themselves from the presented data. These "learnt" rules can then be applied to new, "unseen" data. At this point, it is important to mention that a mixture of both model types is possible. The scope of our research is limited to data-driven models. Within this scope, we identified three different implementations of data-driven models for time series analysis: instance-based approaches that use dynamic time warping, feature-based approaches and feature learning approaches. Feature-based approaches in particular have the benefit of creating interpretable insight into the data and therefore are the preferred approach. The workflow to create feature-based models that we used is depicted in Figure 5b).

Feature Extraction and Selection
The proposed method uses massive feature extraction as a concept by Fulcher and Jones (2014) to create a large set of features from measurement signals. This approach is universally applicable for time series and helps to understand the underlying processes that generate the data. Using features has multiple benefits. It is possible to calculate features from time series with different lengths and, after feature calculation, the amount of data is greatly reduced. This approach of feature extraction does not require extensive domain knowledge, although if useful features are known, they should certainly be included. Typical feature sources that can be used to extract features from time series are (Fulcher, 2017): • Statistical features such as mean, median, mode, standard deviation, quantiles, skewness and variance • Features resulting from a discrete Fourier transform and similar concepts, e.g. discrete wavelet transform • Measures of stationarity • Parameters from statistical time series models

• Autocorrelation functions
This listing is not exhaustive. The informed reader may have noticed that features that have proven to be useful for gearbox damage evaluation have not been mentioned in particular. Many of these domain-specific statistical features, such as FM0, FM4, and M6A use a signal with removed gear mesh and shaft frequencies as well as their respective harmonics from the vibration signal. Removing the gear mesh frequency harmonics would especially eliminate a significant amount of information from the transmission error signal. For more information on statistical features for gear vibration signals, refer to the work of Samuel and Pines (2005), and more recently the work of Zhu, Nostrand, Spiegel, and Morton (2014), and Sharma and Parey (2016). The result of feature extraction is a design matrix where each row represents a sub-series and each column contains the values for a specific feature. This process yields a design matrix with 5,650 columns and therefore 5,650 features. The features are based on the efficient features of the tsfresh python package v0.15.1 (Christ, Braun, Neuffer, & Kempa-Liehr, 2018).
Since this type of feature extraction requires little domain knowledge, the design matrix likely contains features that do not contain meaningful information that contributes to the prediction of gear wear condition. In fact, these useless features can hamper predictive performance and may increase model-training time. Consequently, feature selection is a crucial step in the proposed workflow. Feature selection methods attempt to retain features with the highest predictive performance while removing all others. Generally, three different approaches to feature selection exist (Guyon & Elisseeff, 2000): wrapper methods, embedded methods and filter methods. With numerous features, wrapper methods quickly become computationally infeasible, which is why they are not further explored in the approach we propose. Both filter and embedded feature selection techniques are used in this proposal.
Only data from experiment C is part of feature selection. At first, we used the so called FRESH algorithm developed by Christ et al. (2018) to filter features from the design matrix that show no statistical dependence on the target variable. This was done by means of hypothesis testing. At first, each feature was checked individually for statistical relevance. For continuous features with continuous target variables, the authors of the FRESH algorithm suggest using the Kendall rank test. Rejecting the hypothesis about the data makes it relevant for target prediction. The rejection threshold THOLD k of a p-value is determined by the Benjamini-Yekutieli procedure with Eq. (4) (Benjamini & Yekutieli, 2001). In this equation, is the rank of a p-value after sorting all p-values in ascending order and is the overall number of hypothesis tests conducted to yield the p-values. Once a p-value at rank is lower than the respective threshold THOLD k , all p-values with a higher rank indicate rejected hypothesis tests. The Benjamini-Yekutieli procedure attempts to control the so-called false discovery rate seen in hypothesis testing. The false discover rate coincides with type I errors in hypothesis testing: "discovery" of dependencies that do not actually exist. We used a false discovery rate of 0.05 for the procedure. Completing this procedure reduces the amount of feature columns in the design matrix from 5,650 to 3,038.
(4) Similar information may be contained among the remaining features, i.e. mean and median are typical examples for features that in practice often have the same value (Christ, Kempa-Liehr, & Feindt, 2017). Removing highly correlated features therefore serves the purpose of removing redundant information. Typically, this problem is addressed with a principal component analysis; however, in order to keep intuitive interpretability high, we propose usage of a simple correlation analysis as suggested by Albon (2018). This method uses the Pearson correlation coefficient to quantify the linear correlation of a pair of features. Highly correlated features are defined as having an absolute correlation coefficient higher than 0.95. If a pair of features is highly correlated, one of the two is removed. We made this decision despite the known downfalls of the Pearson correlation coefficient and descriptive statistics. In our dataset, we found and removed 92 correlated features in this step, reducing the feature count to 2,946.
Finally, feature selection concludes with recursive feature elimination, which Guyon, Weston, Barnhill, and Vapnik (2002) first developed and applied in combination with support vector machines. Recursive feature elimination is applicable with every machine learning algorithm that creates an internal feature ranking. Figure 9. Depiction of recursive feature elimination paired with cross-validation.
The algorithm we used for this proposal is random forest regression; we elucidate more details and benefits of treebased methods in the paragraph "Model Selection, Training and Evaluation". The process of recursive feature elimination with cross-validation is depicted in Figure 9.
Since the optimal number of features is not known, a crossvalidation score is calculated at every step of recursive feature elimination. This helps to evaluate at which point there are enough features to describe the system.
First, we passed an initial dataset with all features to the random forest regressor and a cross-validation score was calculated. Subsequently, the random forest regression model was retrained after removing the features with the lowest feature importance from the initial set of features. This process was repeated until all features were removed. The cross-validation scores were used to monitor model performance during the recursive feature elimination process. Every iteration 0.001 % of the initial number of features was removed. We obtained the final number of features by selecting the smallest number of features before the model performance decreased significantly. By observing the plot in Figure 10, we selected a total of 15 features (see Table 4). This corresponds to the final number of features. Figure 10. Cross-validated RMSE during recursive feature elimination.

Model Selection, Training and Evaluation
Finally, we used the selected features to train a machine learning model. For the sake of brevity, we are covering model selection, training and evaluation in a single section. Selecting a machine learning algorithm is no simple task. Many different approaches exist ranging from support vector machines, neural networks, linear regression to decision trees and more advanced deep learning techniques. Machine learning algorithms based on decision trees are popular because of multiple reasons (Louppe, 2014): • Decision trees are interpretable (white box models) • They work with numerical and categorical variables as input • Decision trees select features as a part of the tree growing process • They are robust to errors in the training data • They are non-parametric Obviously there are not only upsides to decision trees; their main downside is their lack of accuracy compared to other machine learning methods, as decision trees are generally known as high variance estimators (Murphy, 2012). Decision trees are the basis for more advanced methods such as random forests (Breiman, 2001) and gradient boosting methods (Friedman, 2001). Both methods are classified as ensemble learning and perform well in comparative studies, e.g. those carried out by Caruana and Niculescu-Mizil (2006), Fernández-Delgado, Cernadas, Barro, and Amorim (2014), and more recently Olson, La Cava, Mustahsan, Varik, and Moore (2018). In fact, random forests and gradient boosted trees are among the most popular algorithms used by data scientists and machine learning practitioners (Kaggle, 2019). Because of their widespread use and their generally good performance, we used both, random forest regression (RFR) and gradient boosted regression trees (GBRT) as models in our proposed workflow and compared the results. Parameter selection and model evaluation were carried out using the data from experiment C, while the test data came from experiments A and B. For this analysis, we treated experiment C as "run-to-failure" experiment that contains all gear wear states we are trying to detect. A 5-fold crossvalidation supports the parameter selection, which we applied after shuffling the samples. The metric we used to score each fold is the RMSE. The models we applied use the implementation by the scikit-learn library (Pedregosa et al., 2011). Parametrization is kept simple, since model performance tends to increase with model size and therefore with tree size. Instead of extensive algorithmic hyperparameter tuning, we therefore selected the parameter values by hand. The most important parameter values are discussed in the following; the remaining parameters are unchanged from the default values provided by scikit-learn.
Both RFR and GBRT models use an ensemble of 500 decision trees, each with an unlimited depth. In order to control overfitting and support generalization, the minimum number of samples required for a node split in a decision tree was set to 5 for both models. The maximum number of features for the growth of each decision tree was capped at 80 % of all available features. Similarly, each decision tree was grown with regard to only 80 % of all available training samples. The gradient boosting method also requires a learning rate that defines how much information each additional tree contributes to the entire model. A typical value for the learning rate is 0.1, which coincides with the value we chose. The results of cross-validation of the final model are portrayed in Table 5. A regression plot for both models displays both the ground truth and the predicted values against each other (see Figure 12).

RESULTS
The results of the proposed workflow follow the model deployment step in Figure 5a). The feature selection process resulted in 15 relevant features. In Figure 11, the six highest rated features according to feature importance are plotted against wear. We improved comparability by min-max normalization of the feature values. The features depicted make up over 90 % of the feature importance determined by the recursive feature elimination algorithm. Three of the top six features are statistical features. While "skewness" and "median" are self-explanatory, the "ratio beyond 0.5 " quantifies the ratio of signal values that are further than half a standard deviation away from the signal mean. The remaining three features are categorized as amplitudes at specific frequencies in the frequency spectrum. All highranking frequencies selected by the feature selection algorithm are harmonics of the gear mesh frequency, namely the first, fourth and fifteenth order of the gear mesh frequency. As previously mentioned, we trained and tuned both the RFR and the GBRT only with data from experiment C, which we considered a "run-to-failure" experiment. We used the data from experiment A and experiment B to simulate deployment of the models. The features extracted from experiment A and B correspond to those selected considering only experiment C. Figure 12 depicts the resulting regression plots from the simulated deployment. The vertical axis of each plot displays the slow-speed wear predicted by the model, while the horizontal axis shows the ground truth. An ideal model would produce the dashed line graph in every plot.
Overall, both models performed well. Goodness of fit, which we determined with the R 2 coefficient of determination and the RMSE for both models, are at acceptable levels. These performance metrics also reveal that the GBRT model slightly outperforms the RFR model. The plots additionally show a resemblance to an approximately piecewise constant function, which is typical for methods based on decision trees.

DISCUSSION
With this paper, we provide an implementation of our proposed approach to condition monitoring and predictive maintenance tasks with a continuous target variable. Without going into further detail, we would like to note that adaptation of this approach to classification problems with categorical target variables is straightforward. The present approach is mainly data driven, as domain knowledge is solely required in the data acquisition and preprocessing steps. Because gear transmission systems in slow operation do not provide sufficient vibration excitation, we chose to use the dynamic transmission error as the input signal to the condition monitoring workflow in the presented use-case.
To demonstrate the viability of this approach, we used a dataset containing transmission error data from spur gears and the respective combined slow-speed wear of both pinion and wheel. First, the data was high-pass filtered and therefore centered on zero. Subsequently, each signal was split into individual hunting tooth cycles. We chose this particular way of splitting, because the hunting tooth cycle contains all meshing combinations and thereby is a universal approach. However, other signal sub-series lengths, e.g. according to one gear shaft frequency or even according to the gear mesh frequency would surely influence the overall model performance. Subsequently, vast amounts of features were extracted from each individual sub-series and through algorithmic feature selection, only the most important features were retained. Compared to manual feature selection, this approach reduces the human effort and involvement in feature engineering significantly. Compared to feature/representation-learning approaches that are typically implemented with deep neural networks, our proposed approach returns features that may easily be understood and interpreted by humans. However, models with feature engineering, both in an automated massive or manual way, are only as good as the available feature set. Although we extracted a large set of features, it is not a given that no other supplemental and perhaps more suitable features exist. In particular, expanding the feature set with features specific to gearboxes could help to improve wear estimation capabilities further. If this approach is used to extract and select features from gear vibration data, the addition of gear vibration specific features is crucial. The parameters of the implemented algorithmic feature selection are chosen based on best practices combined with the usage of reasonable computational resources as ulterior motive. Ideally, the extracted set of features and the progression of feature values allows conclusions about the underlying physical principles generating the data. We would like to encourage experimentation and usage of the proposed workflow for condition monitoring tasks with other machinery and sensor arrangements. Finally, we emphasize that the importance of data acquisition, preprocessing, feature extraction and feature selection is pronounced compared to the other steps in the condition monitoring workflow, especially model building.
Obviously, our approach and implementation are not holistic solutions to all condition monitoring tasks. For example, the models we used are based on decision trees, which are known to not extrapolate well. For our dataset and use-case however, this is not a concern, since extrapolation is not necessary with the existence of "run-to-failure" data. Examples of model types that are better suited for extrapolation are linear models and neural networks. Most of the parameters for the ensemble models GBRT and RFR have values that are similar to the defaults provided by scikit-learn. Because slight modifications to the baseline parameters already yield good results, we did not attempt an additional, extensive algorithmic hyperparameter optimization. All downsides aside, methods based on decision trees are white box models. Decision boundaries and the decision-making process of decision trees are therefore traceable without extensive effort. This attribute is likely relevant for applications that require transparency, such as condition monitoring of components that are crucial for safety in a system.
Overall, the proposed approach shows good slow-speed wear estimation results throughout the whole experiment range. Even at the beginning of lifetime for small quantities of wear, which intuitively are harder to estimate, the results are acceptable. This does especially apply when using a GBRT model. The basis for calculations are three experiments with one of them used for training. The obtained results are even more promising considering the rather low number of experiments used for training and testing. However, this also raises the question of the overall validity of the results when it comes to general gear shapes and setups. All experiments use the same experiment setup with one spur gear stage. Because the transmission error and the vibrational behavior is sensitive to gear shapes and gear system design, additional experiments are nearly imperative. The same accounts for variations in operating conditions such as speed or load. Beyond the physical influences, higher rotational speeds also reduce the angle measurement quality due to a limited sample rate and hence reduce the transmission error determination accuracy.

CONCLUSION
Monitoring of slow-speed gear wear is an important contribution to realizing a predictive maintenance strategy to avoid critical failures of gear systems.
With this paper, we show that an approach based on the dynamic transmission error that contains an automated feature selection process is a suitable way for slow-speed wear estimation. The approach includes extracting a large set of features from the preprocessed transmission error in order to find and select the most relevant features. Therefore, the feature selection and elimination process contain a combined filter and embedded feature selection method including recursive feature elimination. By doing so, only a small amount of domain knowledge is necessary. Nevertheless, due to the explicit nature of calculation, the resulting features still allow for interpretability if required. Comparing both RFR and GBRT models used for subsequent training, GBRT slightly outperforms RFR, but at a high overall performance level.
Considering everything, the presented approach offers good estimations for slow-speed wear in the course of condition monitoring. It even seems to be promising for higher speeds in combination with vibrational analysis, which, however, needs to be addressed in further research.