Learning an Optimal Operational Strategy for Service Life Extension of Gear Wheels with Double Deep Q Networks

One failure mechanism of gear wheels is pitting. If the gear wheel is case hardened, pitting degradation dominates normally at one tooth only. All the other teeth are still intact at the standardized end of life criterion of 4 % pitting area based on the total tooth area. Using an adaptive operational strategy that was developed at the Institute of Machine Components, the service life of gear wheels can be extended by a local stress reduction at the weakest tooth. This is accomplished by applying an adapted torque at the transmission input that shifts a minimum torque in the area of the pre-damaged, and thus, weakest tooth. Consequently, all remaining teeth with higher load bearing capacity are subjected to higher torque. Prerequisite for the described theoretical operational strategy is knowledge on pitting-size and -position. The detection of these properties in operation is not state of the art yet. In this work, only the gearbox vibration signal is known without explicit knowledge about the inside pitting. The challenge is to determine the health for each individual tooth and to choose an optimal adapted torque based on this. This is especially difficult due to differing growth rates of pittings on one individual gear wheel. Hence, different pittings dominate over the service life, which results in the need of a continuous optimization of the torque control. Algorithms of Reinforcement Learning (RL) are particularly suitable for this challenge. In this branch of Machine Learning (ML), an agent interacts inside an environment and learns by getting rewards for taking actions at given states. In this study, the environment is a gearbox-simulation-model, the state is the current vibration signal, and the action is the chosen adapted torque. Thus, it is possible to let the algorithm learn the whole operational strategy, from online failure detection to an adapted torque at the transmission input. The results of this study show the theoretical feasibility of the operational strategy using Double Deep Q Networks as the RL Algorithm. The algorithm is able to learn a suitable reaction to pittings that increase linearly or progressively at an early stage and therefore delays their growth within the defined limits. Thus, the lifetime of the gearbox is extended while maintaining the same total power of the gearbox. As an outlook, the results will be examined for their sensitivity on several influencing factors in a further study. The wider view is to use this simulation on a test rig and validate the results.

In this work, only the gearbox vibration signal is known without explicit knowledge about the inside pitting. The challenge is to determine the health for each individual tooth and to choose an optimal adapted torque based on this. This is especially difficult due to differing growth rates of pittings on one individual gear wheel. Hence, different pittings dominate over the service life, which results in the need of a continuous optimization of the torque control.
Algorithms of Reinforcement Learning (RL) are particularly suitable for this challenge. In this branch of Machine Learning (ML), an agent interacts inside an environment and learns by getting rewards for taking actions at given states. In this study, the environment is a gearbox-simulation-model, the state is the current vibration signal, and the action is the chosen adapted torque. Thus, it is possible to let the algorithm learn the whole operational strategy, from online failure detection to an adapted torque at the transmission input.
The results of this study show the theoretical feasibility of the operational strategy using Double Deep Q Networks as the RL Algorithm. The algorithm is able to learn a suitable reaction to pittings that increase linearly or progressively at an early stage and therefore delays their growth within the defined limits. Thus, the lifetime of the gearbox is extended while maintaining the same total power of the gearbox. As an outlook, the results will be examined for their sensitivity on several influencing factors in a further study. The wider view is to use this simulation on a test rig and validate the results.

INTRODUCTION
Is it possible to increase the remaining useful life (RUL) of gear wheels? In fact, pitting degradation at gear wheels normally occurs at one tooth only. So, all the other intact teeth carry unused potential for the service life of gear wheels. To be precise, pitting is one of the most common and critical failure mode of gear wheels besides tooth root breakage. Pitting failure normally occurs below the pitch circle and describes spallings of the flank due to cracks underneath the case-hardened surface (Naunheimer, Bertsche, Ryborz, and Novak, 2011).
According to the 2006 International Organization for Standardization [ISO] report the failure criteria is a 4 % pitting area at one working tooth flank and therefore this single tooth is responsible for the failure of the gear wheel. With regard to this failure criteria, all the other teeth are still intact at the time of failure. Although some teeth may have smaller pittings. So, the pitting damage at the circumference is uneven and the question arises, if a more even distribution of the pitting damage is possible with an uneven distribution of the load. Previous studies, based on real data and statistical analysis, have shown, that the potential of a local stress reduction at the weakest tooth is huge. Under certain conditions, a service life extension between 10 and 45 % is possible although this study used a conservative assumption (Gretzinger, Lucan, Stoll, and Bertsche, 2020).
The detection of pitting failure in gearboxes is state of the art, but not the localization of the pitting on the gear wheel. The main objective of this work is to extend the service life of gear wheels by applying an optimal adapted load. Therefore, each failure has to be detected, the corresponding tooth position has to be localized and the pitting size has to be evaluated. Moreover, this evaluation has to be done within operating conditions. Based on this information, the optimal adapted load over all teeth states has to be evaluated. Finally, the weakest tooth can get a stress reduction.
The tooth health information is built upon a vibration signal, measured at the housing continuously within operation. In order to tackle the complexity of detection, localization, evaluation and optimal load adaption, a Reinforcement Learning (RL) is used in this paper, which is particularly suitable for this challenge. Hereby, the greatest possible flexibility of the task can be attained. The aim is to perform an automated, self-learning RL Algorithm to adapt the load by using the vibration signals of the gearbox.
In a first step, the application of a RL Algorithm shall be carried out by means of a gearbox simulation model which is particularly build for this purpose. This digital twin of a real test bench is used to test and evaluate Algorithms in prior to build them in the real world, as a state-of-the-art procedure in RL. This gives the possibility to simulate and evaluate different scenarios and in addition, the effect in case of errors is manageable in contrast to a real application. In future work the RL Algorithm can be applied to a test bench setup.

Motivation and Goal
It is assumed that a positive effect on the service lifetime of gear wheels can be achieved with a RL Algorithm. This publication thus makes a valuable contribution to sustainability and conservation of resources, as the service life of gears can be extended or, with the same service life, oversizing can be minimized and thus the tooth width reduced. The novelty lies in the theoretical applicability of reinforcement learning to extend gearbox life without losing the overall power of the gearbox and the need for design changes. Especially, the ability to detect the exact location of weakest tooth of a gear wheel is an advantage over the current state of art. This paper addresses developing an RL Algorithm … • that is able to learn from vibration data only, • that is able to detect the weakest tooth of a gear wheel automatically using only vibration data, • that controls the input torque in a manner that increases the service life of the gear wheel, • that can identify weakest tooth reliably and output the service life extension using an adapted torque.

FUNDAMENTALS
In this chapter, we introduce the core concepts that are needed to fully understand the contents of this paper. These core concepts form the basis for the gearbox toolbox, the stressstrength interference for lifecycle determination of the gear wheel, the needed vibration data, and the RL agent.

Health Detection, Localization and Evaluation
In current applications of gear wheels, different methods for the detection measurement are possible, e.g. visual inspection of the machine during maintenance, analysis of particles in the oil and the measurement of structure-borne noise/vibrations of the housing. As the vibration measurement is the only permanent and continuous one, this method is established as method for damage detection (Nguyen, 2002;Bartelmus, 2011;Randall, 2010). Each gear component emits structure-borne noise during operation. Depending on the transmitting structure between emitter and sensor, signal patterns can be assigned to individual components. The investigation of structure-borne noise for early detection and diagnosis of damage is an established procedure, see also VDI guideline (VDI 3832, 2013).

Vibration Modeling
For the main components within the gearboxes, e.g. gear wheels, bearings and shafts, some equations for the occurring vibrations are available. According to (Heider, 2012) the decisive variable for influencing the tooth mesh frequency is the number of teeth. The following equation describes the tooth mesh frequency.
Here, represents the rotational frequency of the shaft and the number of teeth. For the bearings different frequencies are relevant as they consist of an inner and outer ring and the rolling elements in between. For the estimation of the frequencies without geometric dimensions of the individual bearing components (VDI 3832, 2013) gives the following equations: The frequency of the rolling elements is contained in the frequencies of the inner ring and outer ring . In addition, the number of rolling elements is relevant.
In the spectrum of the frequencies, the harmonics must also be considered as well as different factors that influences the vibration signal. These factors can be divided into primary and secondary. For the primary factor design parameter, such as geometry of the component and the material, and the production technology are relevant. The secondary factor includes the current operation and the change of the operating conditions. (Bartelmus, 2011).

Woehler Curve
The calculation of fatigue strength is often done by using a Woehler curve. Here, the applied stress amplitude and the tolerable load cycles are correlated. Therefore, the Woehler curve is also called SN-Curve (Bertsche & Bullinger, 2007).
The characteristics of the double logarithmic curve are shown in the following figure 1 (Bertsche & Bullinger, 2007). Three zones are visible, the horizontal line of the static strength up to 10,000 load cycles, the fatigue strength with the sloped line and the endurance strength with the horizontal curve from approximately 10 6 load cycles. The fatigue strength can be described with the following equation (Bertsche & Bullinger, 2007).
In the development of transmissions is a second representation of the Woehler curve established. In contrast to the tolerable stress, the input torque of the transmission is applied over the load cycles to failure. The fatigue strength can be described equivalently.
For gear wheel pitting, the stress depends on the torque via root function resulting in the following relationship (Naunheimer et al., 2019):

Reinforcement Learning
Reinforcement Learning (RL) can be seen as a subdomain of Machine Learning, like Supervised Learning. While in supervised learning the ground truth output is known, this is not the case for RL. RL is concerned with sequential decision making and control tasks, where the correct output is often unknown (e.g. in learning a robot to walk, where the optimal actions are unknown). Instead, a reward is provided to the algorithm giving feedback about a good or bad decision. That way, a specific problem is addressed by automatic learning of optimal decisions over time (Lapan, 2018). In this paper, we use this characteristic of RL algorithms to achieve the goals declared in chapter 1. Figure 2 shows the communications between the entities of the RL algorithm. In short, the algorithm consists of two entities (agent and environment), who communicate through three interfaces (reward, state, and actions). The agent interacts with the environment through actions. In this case, the agent is learning how to adapt the input torque on our test bench (or the digital twin: gearbox toolbox) in order to maximize the reward, and hence, the service life of the gear wheel. A detailed introduction to our test bench and its digital twin is given in chapter 3. The environment communicates with the agent through states and rewards. The state is the vibration signal coming from the test bench. It is being used to identify the weakest gear wheel tooth and to determine when to adapt the torque in order to reduce the load. The reward communicates how well the agent's actions contributed to the extension of the service life of the gear wheel. As can be seen from this configuration, the agent and environment interact in a loop, where the agent takes an action, and the environment responds with a reward. The goal is to maximize the accumulated reward until the End-of-Life (EoL) of the gear wheel is reached (Lapan, 2018).

Double Deep Q Networks (Double DQN) and Signal Energy
Choosing the action with highest Q-value does not always result in an optimal decision. In order to avoid unrealistically high action values, and therefore prevent the agent from always choosing the action with the highest so-called Q-value in any state, Double Deep Q Networks are highly recommended by leading experts (Hasselt, Guez, and Silver, 2015). This algorithm uses a Neural Network architecture and deals with the overestimation problem mentioned above. For more details regarding the Double DQN, see Hasselt et al. (2015).
The success of an RL Algorithm strongly depends on how well the reward signal frames the goal. It is possible that the agent learns how to reach high reward in a way that is dangerous or malfunctional to the environment. While a positive reward encourages the agent to keep on going to accumulate as much positive rewards as possible and avoid terminal states, a negative reward encourages to end an episode as fast as possible.
There are multiple ways to shape the reward in respect to real work applicability. On the one hand, reward shaping can be based on high end diagnostic models detecting, locating, and assessing the real teeth degradation. Building and transferring these models to other gearbox designs is often quite challenging. On the other hand, the reward signal can be shaped based on the gearbox lifetime. The lifetime is relatively easy to access without a high-end measuring system. Additionally, the lifetime is a direct measure of the learning goal "service life extension. In this case the learning ability of the RL Algorithm is being used to detect, locate, and access.
In detail, the reward corresponds to the achieved difference in lifetime ∆ between the current and the previous measurement point (step of the learning algorithm) calculated. This is scaled by a factor of 10 −6 , so that 1 • 10 6 cycles correspond to a reward of 1. The reward function is given by equation 8. In addition to the direct measurement of the target variable, this reward signal offers a simple interpretability.: = ∆ • 10 −6 (8)

STRATEGY FOR LIFETIME EXTENSION OF GEAR WHEELS
This chapter introduces the adaptive operational strategy that was developed using data from testing gear wheels on a custom test bench (see chapter 3.1). Also, the data from the test bench was used to program and validate the gearbox toolbox.
In practice, a constant torque is currently applied to a gearbox. It can change due to a load spectrum, but it remains nearly constant concerning one revolution. The gearbox is operating in this specific load spectrum until a system failure occurs that is detected by a condition monitoring system. This case is shown in figure 3 on the left side. One flank of the gear wheel degrades until the End-of-Life criterion of 4% pitting area is reached.
Preceding studies for extension of service life of gear wheels use online damage accumulation to estimate the health of the gearing on the basis of torque and speed. If a certain predefined accumulated damage is reached, the input torque of the transmission is reduced in order to increase the remaining useful life (Foulard, 2015). But the transmission power for all teeth is reduced. Hence, all teeth undergo a reduced stress although only one tooth (the weakest) leads to a failure of the gear wheel. The scattering in the load capacity of the individual teeth is therefore not considered.
In contrast to the state of research, the scattering in the load capacity is considered in the present work. The strategy for lifetime extension of gear wheels is shown in figure 3 on the right side.
Here, the information about a progressing pitting is used to start a PHM (Prognostics and Health Management)-Control Loop. The existence and localization of a pitting influences the control of the input torque and a local stress reduction at the pre-damaged tooth is carried out. Due to this local stress reduction the growth of the pitting is slowing down and an increase in lifetime is possible. All the other more loadable and therefore intact teeth undergo a higher stress in the form of a higher input torque so that the power concerning one revolution can stay the same. This is a great advantage over the state of research.
The adapted input torque can have different shapes. The sinusoidal oscillation is a very simple possibility of local stress reduction. Only the amplitude of the oscillation must be specified. The frequency results directly from the speed of the gear wheel and the phase from the position of the weakest tooth. With this signal a constant average power is guaranteed. Furthermore, there are two options for a rectangular function as adapted input signal. First, the area of the weakest tooth can get a reduced stress and all the other teeth receive an unchanged torque. However, the disadvantage is that the overall power output of the gearbox is reduced. For this reason, a second rectangular function was developed. The area around the weakest tooth is the same as the first rectangular variant but here, the other more loadable teeth receive a higher torque so that there is no loss in overall power concerning one revolution. The advantage over the sinusoidal oscillation is that the maximum torque, and therefore the maximum stress, is lower as all the intact teeth compensate for the reduced stress at the weakest tooth. Within the sinusoidal oscillation the tooth at the opposite of the weakest gets a very high stress. If this tooth is the second weakest, the increase of lifetime is clearly weaker.
The optimum adapted torque could be achieved, if the health of each tooth is known. Each tooth can then get his own stress and the damage around the circumference could be very smooth. In this case the failure of the system is not dependent on the weakest tooth only and the second failure criteria for the ISO 6336 applies. The gear wheel fails, if 0.5 % of the total working flank area is damaged by pitting.

Test Bench
At the Institute of Machine Components an electrical stress test bench is used to investigate the degradation of gear wheel pitting. This test bench is shown in figure 4  . Two electrical machines can be seen in the picture, the left one is the drive unit and the right one the brake unit. In the middle, a single stage test transmission with a serial gearing is mounted. The number of teeth is 21 for the pinion at the input and 41 for the gear wheel at the output. Accordingly, the ratio of this helical gearing is 1.952.
With this test setup degradation tests are run with the aim of getting the Woehler Curve of this gearing at different pitting sizes. Two constant stress levels were tested, 175 Nm and 200 Nm at a constant output speed of 1300 rpm and oil temperature of 90 °C.
The investigation focused on the degradation measurement of the pinion, which is why the test was interrupted each 0.25 million load cycles and a negative imprint of the biggest pittings were made. The size of this pittings were then analyzed in 3D with a Keyence laser microscope with 20x magnification. The result of this measurement is shown in figure 5. The test was terminated, if the working flank of one single tooth achieved the standardized end of life criterion of 4% pitting area. Furthermore, the vibrations of the gearbox housing were measured continuously.

Life Data Analysis of the Degradation Tests
The results of the above-mentioned tests were analyzed in different ways. First of all, the degradation path of the pittings were investigated. This is necessary to model the growth of the pittings and also to fit distribution functions of the load cycles at different pitting sizes. Following generic mathematical model can be used for all pittings as validated in Gretzinger (2020): As an example, a degradation path for one gear pitting is shown in figure 7. The data points in the diagram represent the measurement points and the grey line the regression of the degradation path. The test was stopped at 11 million load cycles because another pitting at this pinion reached the 4 % criteria. But with this regression model, all pittings at the circumference could be extrapolated to the end of life and the data could be used to generate a distribution function. This distribution function therefore represents all the pittings at the pinion and not only the failure of the whole system as it is done in the state of the art (Beslic, Mueller, Yan, and Bertsche, 2017). The investigated distribution functions for different pitting sizes were published in .

Gearbox Toolbox
As the real gearbox, the simulation toolbox consists of four bearings, two gears, and the shafts as shown in figure 9. For all components, the vibration behavior is modeled. The degradation behavior is only modeled for the gears as observed in the experimental study. There are various ways to model system vibration or system degradation. For example, finite element methods (FEM) can be used in both cases. While the results might be quite accurate the computational cost is high. To simulate hundreds of episodes it would take a couple of weeks, which is not feasible. Other methods to model vibrations are mechanical systems modeling based on mass, stiffness, damping and various other parameters. These parameters are unknown for the given test bench and their determination is time and cost intensive. In this case neither the computational cost nor the detailed analyses on stiffness and damping are feasible. Therefore, the simulation model is based on the statistical analysis of the real test bench along with the general assumptions for vibration modeling. All factors are modeled in a holistic approach, utilizing a wide spectrum on flexibility for adaption and sensitivity studies. For a detailed explanation on how the toolbox was developed and validated, see (Henss, 2020).

APPLYING THE RL ALGORITHM
In order to tune the hyperparameters of the RL algorithm, we explore thousands of model architectures and select the hyperparameter set that maximizes the accumulated rewards until the EoL of the gear wheel. Figure 8 shows several high performing hyperparameter sets that extend the service life of the gear wheel in comparison to the initial gear wheel lifetime without any adaptive operational strategy. It is evident that a tuned and trained RL algorithm has significantly higher rewards than the non-adaptive operational strategy. The RL algorithm enables to automatically detect the weakest tooth and applying the adaptive strategy explained in chapter 3. The accumulated reward can only be maximized, if the true weakest tooth was chosen by the RL algorithm. If the false tooth is chosen, the weakest tooth has an increased load, which would result in a shorter service life of the gear wheel. But in this paper, we solely model the weakest tooth and not the other teeth on the gear wheel. Therefore, choosing the wrong tooth does not result in a shorter lifetime in our simulation setup and the reward represents the initial lifetime in 10 6 cycles. Table 1 compares the RL algorithm to the ideal operational strategy introduced in chapter 3. Both approaches significantly extend the lifetime of gear wheels.

CONCLUSION AND OUTLOOK
This paper is the first to prove the theoretical applicability of reinforcement learning to extend gearbox life. Specifically, the service life is extended by 8.24 %, which saves costs and resources. The service life extension is achieved without any change in design or utilization, but solely based on an intelligent operating strategy. The operating strategy is learned independently by a reinforcement learning algorithm, based on a reward that rewards long gearbox service life.
The basis of the training is a gearbox simulation environment to enable the high demand of training episodes. The simulation environment is a replica of a real test bench to make the conditions as realistic as possible.
There are two limitations in this paper. The first one is that the simulation code currently can only consider the lifetime of the weakest tooth. All other teeth's lifetime is not considered, yet. Also, choosing the wrong tooth as the weakest tooth does not result in a penalty. An update to address both limitations is currently in the works.
A next step is the further optimization of the RL algorithm based on fast test series in the simulation environment.
Another step is the transfer of the pre-trained algorithm to the real test bench.
The use of an RL algorithm shown in this work to extend service life by optimizing the operating strategy, can theoretically be transferred to a variety of other systems/products or processes.