Domain Adversarial Transfer Learning for Generalized Tool Wear Prediction

Given its demonstrated ability in analyzing and revealing patterns underlying data, Deep Learning (DL) has been increasingly investigated to complement physics-based models in various aspects of smart manufacturing, such as machine condition monitoring and fault diagnosis, complex manufacturing process modeling, and quality inspection. However, successful implementation of DL techniques relies significantly on the amount, variety, and veracity of data for robust network training. Also, the distributions of data used for network training and application should be identical to avoid the internal covariance shift problem that reduces the network performance applicability. As a promising solution to address these challenges, Transfer Learning (TL) enables DL networks trained on a source domain and task to be applied to a separate target domain and task. This paper presents a domain adversarial TL approach, based upon the concepts of generative adversarial networks. In this method, the optimizer seeks to minimize the loss (i.e., regression or classification accuracy) across the labeled training examples from the source domain while maximizing the loss of the domain classifier across the source and target data sets (i.e., maximizing the similarity of source and target features). The developed domain adversarial TL method has been implemented on a 1D CNN backbone network and evaluated for prediction of tool wear propagation, using NASA's milling dataset. The experimental results indicate that domain adversarial TL can successfully allow DL models trained on certain scenarios to be applied to other scenarios.


INTRODUCTION
In manufacturing processes, certain quantities and measurements can reveal critical information about the health condition of manufacturing machine tools, process efficiency, or product quality. The relevant data collection and analysis, known as condition monitoring, contribute telemetry needed to transform traditional manufacturing into smart, data-driven manufacturing. Deep Learning (DL) provides the necessary tools for handling the big manufacturing data and translating the raw data into information and knowledge that can facilitate process-level and system-level decision makings, such as machine tool predictive maintenance and process optimization (Lee, Jin, Bagheri, & Chao, 2016). For example, one DL variant, the convolutional neural network (CNN), has been investigated to identify machine operating condition, fault detection and diagnosis, tool wear and remaining useful life (RUL) prediction, and product assessment (Wang, Ma, Zhang, Gao, & Wu, 2018;Li, Ota, & Dong, 2018). However, the performance of DL models primarily relies on the amount and variety of training data; DL models are not inherently generalizable. In manufacturing applications, the robustness of DL models is not guaranteed due to process-to-process variation and changes in operating conditions. With enough training examples under certain operating conditions, DL networks can achieve satisfactory performance but will suffer significant adverse effects when utilized outside the strict confines of the original conditions. As a promising solution to overcome this limitation, Transfer Learning (TL) allows DL models trained on a source task or domain to be transferred to a second, related task or input domain (Pan & Yang, 2010).
Within the scope of TL, domain adaptation addresses performing a single task (e.g., RUL prediction) across input domains with differences caused by changing operating conditions or different types of manufacturing machine tools. The change in input data distribution is known as the covariate shift problem, which limits the applicability of DL models (Shimodaira, 2000). To fundamentally address the problem, DL models must adapt to compensate for the differences across the domains through learning common Matthew Russell et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. latent features shared by the domains. Several methods have been proposed for feature generalization of DL models across the source domain and the target domain. Some methods achieve feature generalization by adding an explicit loss term to the DL network loss function, to penalize the network for generating different feature distributions from source and target inputs. To measure the feature distribution distance, metrics including Correlation Alignment (CORAL), Kullback-Leibler Divergence (KLD), and Maximum Mean Discrepancy (MMD), have been defined or leveraged (Sun & Saenko, 2016;Pan, Kwok, & Yang, 2008;Tzeng, Hoffman, Zhang, Saenko, & Darrell, 2014). Alternative methods for coalescing feature distributions opt for an adversarial approach, in which a domain classifier attempts to identify which input domain, source or target, an input feature came from, while the feature extraction layers compete with it to develop a feature mapping that prevents distinguishability. Several variants of this method include the ReverseGrad approach from Ganin & Lempitsky (2015) and Ganin et al. (2016), Adversarial Discriminative Domain Adaptation (ADDA) introduced by Tzeng et al. (2017), and conditional adversarial domain adaptation discussed by Long et al. (2018) and related work by Hoffman et al. (2017). Domain adversarial transfer learning will be investigated in this paper because of its ability to generalize features without the need for a separately-defined measurement metric.
Recent work in smart manufacturing has explored TL via domain adaptation in connection to condition monitoring. Sun et al. (2018) investigated KLD to promote feature similarity for RUL prediction through an autoencoder DL architecture. Similarly, Wen et al. (2019) examined the benefits of MMD as a feature discrepancy metric for TL in bearing fault classification scenarios. Adversarial approaches have also received attention and demonstrated promising results. Liu et al. (2019) investigated the advantages of applying a GAN loss-based approach in keeping with ADDA from Tzeng et al. (2017) to transfer knowledge from a pretrained source LSTM network to a target domain with a different cutting tool. Guo et al. (2019) combined MMDbased feature discrepancy loss with domain adversarial GAN-style loss to classify machine health conditions, via features extracted from vibration data by 1D CNNs. Zhang et al. (2019) applied the ADDA approach to generalized feature generation for a 2D CNN-based characterization of vibration frequency spectra. More recently, da Cost et al. (2020) successfully applied the ReverseGrad approach that is initially developed by Ganin et al. (2016) to predict RUL of aircraft turbofan engines operating under different altitudes and throttle levels.
This study develops a Domain Adversarial Transfer Learning (DATL) method through ReverseGrad for domain adaptation of tool wear regression and prediction during a milling process. Vibration signals are processed by a 1D CNN, and the extracted features are then correlated to tool wear through fully-connected layers. A domain classifier attempts to distinguish between features generated from the source domain and target domain data, which were collected under different operating conditions. The entire network can be divided into three parts according to their functions: feature extraction, domain classification, and wear prediction, which are optimized simultaneously with a gradient reversal layer between the feature extractor and the domain classifier. Through the gradient reversal layer, the 1D CNN is forced to extract features that confuse the concurrently optimized domain classifier. For performance evaluation, the performance improvement of the developed DATL versus the availability of target domain data is demonstrated.
The remainder of this paper is organized as follows: Section 2 provides the theoretical background required by the approach, Section 3 outlines the experimental dataset and architecture, Section 4 presents results and analysis, and Section 5 concludes with a discussion of future work.

Deep Learning and Convolutional Neural Networks
DL represents a hierarchical learning structure, as a combination of multilayer artificial neural networks and specialized network architectures. Analogous to progressive stages of abstraction, early layers in the networks learn intermediate representations of the input data (i.e., low-level features), while later layers fuse these low-level features to high-level features that better reveal the properties of the objects to be analyzed. Due to the large numbers of free parameters inherent to DL models, several specialized architectures have been developed to leverage known underlying structures in the input data. One highly successful architecture is CNN. Instead of a fully-connected layer that simultaneously learns weights for each input neuron, CNNs take advantage of underlying spatial connections in each input example. For example, images often consist of building blocks such as edges and shapes that constitute high-level motifs. Thus, instead of attempting to directly apply a dense, fully-connected network to processing image pixels, CNNs search for smaller pattern blocks within the larger input by convolving kernels with the input. Each separate kernel produces a new channel of the input image, representing an extracted pattern. The convolutional output can then be compressed by pooling values in adjacent regions, summarizing these regions to suppress local variations in the pixels while at the same time reducing the number of network parameters to be tuned and improving the computational efficiency. Hierarchical stacks of convolutional and pooling layers are employed as feature extraction layers in DL models, which end with a full-connected network for classification or regression on the learned features.
CNNs operating on 2-dimensional input have demonstrated excellent performance in image recognition problems, and 1D CNNs have proved successful on sequential inputs such as vibration signals (Krizhevsky, Sutskever, & Hinton, 2012;Abdeljaber, Avci, Kiranyaz, Gabbouj, & Inman, 2017). For 1D inputs and kernels, the convolution operation can be expressed as where is the convolutional kernel of length , is the input signal of length , and is the stride or the number of data points that the kernel moves per step across the input. The bounds of can be set to stay within the original signal length (or including zero-padded regions on sides of the input). The convolutional output ( ) can then be passed through an activation function such as the rectified linear unit (ReLU) to introduce nonlinearity: Next, this activation result ( ) is pooled. Max pooling is often used and selects the maximum value in an -length sliding window across the input: Here, is the stride of the pooling window step across the obtained features from the convolutional operations. One or more convolution and pooling layers may be stacked in sequence to construct the feature extraction network. The outputs are then fed into a fully-connected, multilayer network for classification or regression. CNNs can be trained using traditional backpropagation techniques, including stochastic gradient descent (SGD) and RMSProp-based Adam algorithm.

Transfer Learning
With potentially millions of parameters, DL models are capable of fitting to very complex problems when equally large amounts of training data are available for the desired task. However, due to complexities or physical limitations in some application scenarios, collecting data with proper annotations may not be physically or economically feasible, which creates a challenge for robust model training. With these considerations in mind, Transfer Learning (TL) seeks to discover ways of transferring knowledge learned on a wellposed problem source domain with sufficient training data to a related target domain task. Pan & Yang (2010) outlined several TL variants, which are developed to address different application scenarios, including 1) transferring knowledge obtained from one task to a new task in the same data domain (e.g., transferring vibration analysis from tool wear monitoring to chattering control); 2) transferring from one data domain to another data domain for the same task (e.g., machine tool RUL prediction w.r.t. different types of machine tools under different operating conditions); and 3) a mix of the first two scenarios. The second application is also known as domain adaptation and is of particular interest to machine condition monitoring and health management, since common prognosis tasks may encounter challenges from data differences due to process-to-process variation or varying operating conditions. Thus, it is desirable to train a model on a source domain, but maintain its performance on the other domains under the existence of data variances.
With notation from Pan & Yang (2010), the related tasks share a common input set and output domain . The source domain and the target domain can be written with their respective marginal probabilities: Although the domain inputs are drawn from the same set , they do not follow the same distribution due to the variance in operating conditions and other factors, i.e., ( ) ≠ ( ), and the distribution of the covariate of the desired output has shifted between the two scenarios. Since the DL model learns the conditional probability of the output on the input distribution, this creates an inherent difficulty termed the covariance shift problem (Shimodaira, 2000). Domain adaptation seeks to eliminate the difference in marginal probabilities by developing a feature mapping that generates a consistent feature distribution across both domains. That is, domain adaptation searches for nonlinear feature mapping (⋅) such that ( ( )) = ( ( )). . These features measure the distance or discrepancy between feature distributions statistically or probabilistically. During the network training phase, this distance penalizes feature mappings which separate the source and target features. However, these metrics must be designed and tuned to adequately capture the similarity between the two feature distributions, and characteristics of the input data may affect how well each metric performs.

Principles of DATL
Eliminating dependency on these metrics, Ganin & Lempitsky (2015) introduced, and Ganin et al. (2016) further developed, a different approach for encouraging similar source and target feature distributions based on adversarial networks known as DATL. Instead of a discrepancy metric in the network loss function, a new domain classifier network is added in parallel to the original fully connected layers for regression (see Fig. 1). The goal of this domain classifier is to determine which input distribution a given feature came from. Optimizing this classifier improves its ability to differentiate these source and target features.

Figure 1. A 1D CNN ReverseGrad domain adversarial approach for tool wear regression with transfer learning
It should be mentioned that once gradients have been backpropagated through the classifier, they are reversed before continuing back through the feature extraction layers (i.e., 1D CNN in Fig. 1). This forces the domain classifier to seek the best parameters to distinguish the source and target domain features, while simultaneously have the feature extraction in 1D CNN to move towards mappings that cause the features of two domains to be indistinguishable. In this adversarial manner, the network is encouraged to converge to feature extraction layer parameters, which map the features of both the source and target domain to the same distribution. The overall loss function can be summarized as where is the total loss, is the loss of the output classifier or regressor, and is the loss of the domain classifier, with a regularization parameter . During feedforward prediction, the gradient reversal layer between the 1D CNN and the domain classifier acts as the identity function, but during backpropagation, the gradient from the domain classifier is reversed as: where 0 is the gradient at the input of the domain classifier, −1 is the gradient at the last layer of the feature extractor, and is a hyperparameter that controls the impact of the domain classifier on the optimization of the feature extraction layers.
Additional variants of the adversarial approach have been proposed, most notably Adversarial Discriminative Domain Adaptation (ADDA) which modifies the ReverseGrad approach to remove the gradient reversal layer and use two separate and competing optimization processes to pit the domain classifier against the feature extraction layers, more similar to a traditional generative adversarial network Goodfellow, et al., 2014). Both ADDA and ReverseGrad have been investigated in manufacturing, and this study presents a novel application of ReverseGrad to realize DATL upon a 1D CNN for vibration signal analysis and tool wear regression under two operating conditions (Zhang, Li, Wen, Gao, & Gao, 2019;da Costa, Akçay, Zhang, & Kaymak, 2020).

Milling Data Set and Data Preparation
The Milling Data Set published by NASA Ames Research Center is analyzed in this study (Agogino and Goebel, 2007). The data set was collected across 16 milling process test cases with varying materials, feed rates, and cut depths. As these cuts are performed, the rotating tool head experiences progressive flank wear, which was recorded throughout the runs. Each configuration of material, depth of cut, and feed rate was tested twice. Vibration data were collected from two locations, the spindle and the table. Table vibration data was  used for this study. For transfer learning, the source data domain was chosen to be the cast iron material with a 0.25 mm/s feed rate, and the target domain was cast iron with a 0.5 mm/s feed rate, across all cutting depths. Excluding eight of the 16 data set cases which used steel, this choice resulted in using four of the remaining eight cases as source data (i.e., all cast iron runs at 0.25 mm/s) and the final four cases as target data (i.e., all cast iron runs at 0.5 mm/s).
Each case had individual runs consisting of approximately 9000 data points with a 250-Hz sampling rate of the vibration sensor (i.e., a total of 36 seconds). The beginning and ending portions of the run differ significantly from the rest of the signal as the milling process starts and ends, respectively. Hence, the first and last 10 seconds of the signal were removed. To augment the dataset, these cropped 16-second blocks were further split into 2-second (500-sample) sections and given the same flank wear label as the run from which they were taken.
The source and target data sets contained the 2-second windows from the four source and four target cases, respectively. Within each source and target data set, the collection of 2-second windows was split into 70% train, 15% validation, and 15% test sets.

DATL Network Architecture
The developed 1D CNN-based DATL network architecture is shown in Fig. 2. A 1D CNN performs feature extraction using three pairs of 1D convolutional and max-pooling layers with ReLU activation functions. This set of feature extraction layers is shared between the source and target domains to generate consistent, similar features from both input distributions.
The flank wear regression output is generated by three fullyconnected layers terminating in a sigmoid output neuron. Dropout is used throughout the feature extraction and output layers to combat the overfitting problem. Regression loss is calculated via mean squared error. Domain classification is performed using three fully-connected layers closely resembling the regression layers. However, the number of hidden layer neurons is reduced, and the output is two softmax neurons representing a one-hot vector encoding of the input domain, either source or target. A cross-entropy loss function is used. In addition, a gradient reversal layer is included to invert the backpropagated error as it exits the domain classifier. Thus, while backpropagation makes the domain classifier itself to minimize its loss, the gradient reversal layer will cause the feature extraction layers to seek to maximize the domain classifier's loss, thereby increasing the likelihood that source and target features will not be differentiable.

Training
In each test case, the network was trained for 5000 epochs with early stopping on the source or target validation set performance metric (the coefficient of determination, denoted R 2 ). The optimizer was Adam with a learning rate of 0.0001. The minibatch size was 20. Experiments were run on an NVIDIA P100 GPU and required between 8 and 15 minutes each to complete.
During transfer learning, the parameter controls the infusion of domain classifier loss into the optimization of the feature extraction layers. As recommended by the literature, this parameter was defined as where ∈ [0,1) is the training progress (Guo, Lei, Xing, Yan, & Li, 2019). With the parameter initially close to zero, the optimization process focuses on developing initial features via the labeled source examples. As the parameter increases, the domain classifier loss is gradually incorporated to prevent the features from becoming overly specific to the source domain.

RESULTS AND DISCUSSION
Several training cases were performed to evaluate the performance of the developed DATL on the flank wear prediction problem. First, a performance comparison is made between with and without DATL by evaluating the accuracy (quantified by R 2 ) of the 1D CNN (trained solely on source domain data) on the target domain data. Furthermore, the comparison is expanded to more scenarios where different availability of target domain data can be used for network training, as illustrated in Fig. 3. Without transfer learning, an R 2 of 0.97 was achieved on the source task. It is seen from Fig. 3 that DATL greatly improves the performance of 1D CNN trained on target domain data, when there is no available target domain data to jointly (with source domain data) train the network. Along with the availability of target domain data, the performance without transfer learning gradually catches up with the performance with transfer learning.

Figure 3. Target domain test set R 2 performance
The validation R 2 plots for trials without transfer learning demonstrate that as training continues, the target R 2 peaks and begins to decrease (see Fig. 4 left), indicating that the network has begun learning features specific to the source domain instead of those general to the flank wear problem across both operating scenarios. This R 2 divergence in later epochs vanishes when the domain adversarial transfer approach is applied (see Fig. 4b). Without any labeled target domain data, the DATL algorithm immediately increases the target R 2 from 0.43 to 0.69 and achieves a performance of 0.88 when 10% of the labeled target training data is included.
To illustrate how DATL helps with extracting common features from different domains, distributions of features extracted by the 1D CNN on the target domain data (features of the last 1D CNN layer) are shown in Fig. 5 using t-SNE. The distributions are plotted against the severity of the flank wear (mild wear: 0-0.3 mm and severe wear: 0.3-0.8 mm). As shown in Fig. 5, DATL reduces the overlap between feature distributions, makes the boundary between points of two wear severities clearer. Although there are still some mixes among wear points due to either the imperfect differentiation by the model or the artificially introduced errors in differentiating the wear severities, the results confirm the effectiveness of DATL in generalizing DL models regardless the application domains.

CONCLUSION
This study demonstrates the application of domain adversarial transfer learning (DATL) to predict tool flank wear in a milling process. The ReverseGrad model was successfully trained on a source task feed rate and transferred to a second set of process parameters with a different feed rate. DATL enables the network to be simultaneously trained on the source and target tasks, encouraging the feature extraction of the network to generate common features to both input domains. The approach significantly improved the network performance on the target task and demonstrated the ability of DATL to compensate for limited labeled target examples. Future work includes investigating different ways for generating and fusing gradients of domain classification and regression, to further improve the performance.