Data Augmentation of Sensor Time Series using Time-varying Autoregressive Processes

This work presents a novel data-centric solution for fault diagnostics and failure prognostics consisting of a data-augmentation method which is well suited for non-stationary mutivariate time-series data. The method, based on time-varying autoregressive processes, can be employed to extract key information from a limited number of samples and generate new artificial samples in a way that benefits the development of diagnostics and prognostics solutions. The proposed approach is tested based on three real-world datasets associated with failure diagnostics problems using two types of machine learning methods. Results indicate the proposed method improves performance in all tested cases.


INTRODUCTION
Fault diagnostics and failure prognostics of equipment and systems, a.k.a.PHM, predictive maintenance, etc., have been the focus of active investigation and intense real world solution development during the recent decades.The motivation is clear, as avoiding the occurrence of failures or reducing their consequences in general translates in benefits such as reduced downtime, improved yield and safety.A large variety of methods have been proposed over the years, ranging from reliability methods based on population statistics to data-driven solutions employing cutting-edge machine learning methods, or physics-based approaches incorporating advanced failure mechanism models.However, it can be argued that the most important factors limiting the successful development and application of PHM solutions are not in the methods themselves.Those limitations are in general related to the availability of good quality historical data which can be used for the development and validation of such solutions (Biggio & Kastanis, 2020;Kim, Choi, & Kim, 2021).Failure events Douglas Baptista de Souza et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.can be very rare but their impact can be so relevant that it is worth investing in development of related PHM solutions.Many times the failure events of interest may have happened a reasonable number of times in the past such that available data from those events would be enough for development of related PHM solutions.However, many times, relevant sensor data associated to the historical events has not been collected or, if collected, corresponding labels are not available, or if available they are not reliable or not detailed enough, e.g., defining the actual failure mode.
Such data limitations naturally affect more directly datadriven methods, but reliability-based and even physics-based methods may also be impacted.The former because reliability solutions are also based on historical data, despite simplifications such as the assumption of parametric statistical models or the combination of data from different but similar equipment.The latter is also impacted, although to a lesser extent, as physics-based diagnostics and prognostics solutions must also be validated based on sufficient real historical data before they can be deployed to the field.Whenever enough data is not available, despite the extensive literature related to PHM methods, the development and validation of diagnostics or prognostics solution may result in poor performance or may not even be feasible.In such cases, the possibilities for PHM solution development are limited to anomaly detection which are usually not as prescriptive or actionable as fault diagnostics and failure prognostics.
Considering the points above, data-centric approaches to PHM can be valuable in terms of addressing the relevant data-related challenges faced in real world both for diagnostics (Leao, Fradkin, Lan, & Wang, 2021) and prognostics (Garan, Tidriri, & Kovalenko, 2022).However, focusing on a data-centric PHM still represents a broad set of possibilities ranging from improved data collection and labeling to approaches for making the most out of available data.Among those possibilities, data augmentation methods have gained ever increasing attention over the recent years also considering both diagnostics (Matei, Zhenirovskyy, de Kleer, & Feldman, 2018;Kwak & Lee, 2023) and prognostics (Kim, Kim, & Choi, 2020) applications.In particular, data augmentation methods have improved the performance of diagnostics approaches with limited data in different PHM use cases (Yang et al., 2023;D. Wang, Dong, Wang, & Tang, 2023;Li, Zhang, & Zhang, 2023;Jiang & Ge, 2021;Shen, Yao, Jiang, Yang, & Zeng, 2023;de Oliveira, Niemi, García-Ortiz, & Torres, 2023;Taghiyarrenani & Berenji, 2022).This work proposes a novel method for augmentation of multivariate time-series data based on time-varying autoregressive (TVAR) models.The goal is to extract information from scarce data and use it to create additional samples in a way that can improve the quality of diagnostics and prognostics solutions.One characteristic which makes it especially suited for failure diagnostics and prognostics problems is the fact that it can directly deal with non-stationary time series.
The remaining of this paper is organized as follows: section 2 contains the technical background related to TVAR; in section 3 the proposed application of TVAR for data augmentation is presented; experiments and results are described and discussed in section 4; section 5 is the conclusion.

Considerations About The Time Series Data
This paper addresses the problem of augmenting sensor time series from the viewpoint of time-series groups or classes.To this end, we consider that the data to be analyzed are divided into classes like "anomalous" and "normal".Furthermore, the following assumptions are considered for the time-series datasets to be augmented by the proposed method: A1) Time series of the same dataset are sampled equally.A2) Time series of the same class can be modeled by the same mutivariate, potentially non-stationary stochastic process.
The potential non-stationary behavior means that the properties of the stochastic process are allowed to vary over time.The proposed data augmentation derives from assumption A2), and consists in finding a suitable stochastic process to model the time series of a given class, then using it to create new time series as new realizations of the stochastic process.

Considerations About The Time Series Models
In this work, we use parametric time series models to represent the stochastic processes characterizing the signal classes.These models describe mathematically how the samples and stochastic moments of the time series vary over time time.Examples of well-known models are Autoregressive (AR), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA) (Deistler & Scher-rer, 2022).Such models often require the data to be stationary, or the potential nonstationarity in the data to follow specific forms (e.g., trend and seasonality), so it can be extracted from the data through subtraction, differencing or other transformations (Box, Jenkins, Reinsel, & Ljung, 2015).However, real-world time series often show a variety of non-stationary behaviors that cannot be modeled by traditional approaches (de Souza, Chanussot, Favre, & Borgnat, 2012, 2014, 2018, 2019;de Souza, Chanussot, & Favre, 2014).
Unlike classical approaches, time-varying autoregressive (TVAR) models allow their parameters to change over time to better capture the nonstationarities in the data (Kay, 2008).This change is controlled by a set of TVAR basis functions, weights, and the model order.Based on how these quantities are chosen, different non-stationary behaviors can be modeled (Niedzwiecki, 2000).Owing to these powerful modeling capabilities, we choose to characterize the underlying stochastic process of the time series with a TVAR model, meaning that time series belonging to the same class will share the same TVAR expression (see A2). Different customizations to the TVAR basis functions and parameters have been proposed for modeling various types of non-stationary data (e.g., acoustic signals (Sodsri, 2003), electroencephalography (EEG) (Pachori & Sircar, 2008), and radar clutter data (Abramovich, Spencer, & Turley, 2007)).Here, we choose the customization proposed in (de Souza, Kuhn, & Seara, 2019) and use it as a startpoint to derive the data-augmentation procedure.

TVAR Model
In (de Souza, Kuhn, & Seara, 2019), a TVAR process of first-order has been proposed to model non-stationary processes whose mean and covariance vary over time following an arbitrary functional form defined by the user.The nonstationary processes characterized by the TVAR model are M -dimensional vectors x(n) (multivariate time series) varying in time according to the following regressive expression: with n as time variable, v(n) ∈ R M as the perturbation noise, and a(n) and b(n) as time-varying parameters given by the weighted sum of f 0 (n), ..., f Q (n) scalar basis functions having α 0 , ..., α Q and β 0 , ..., β Q as weights.Also, two assumptions are considered for the TVAR model in Eq. ( 1): A3) The initial value of the process x( 0) is an arbitrary free parameter defined by the user.A4) the noise vector v(n) is stationary with zero mean and Considering the assumptions above and Eq. ( 2), one can obtain the following general expressions for the mean vector and covariance matrix of x(n) in Eq. ( 1), respectively: and whereas the non-stationary behavior of x(n) can be verified by the fact that Eqs. ( 3) and ( 4) are functions of n.The basis functions and constants in Eq. ( 2) are often customized so the TVAR model can describe different non-stationary evolutions.One customization for Eq. ( 2) proposed in (de Souza, Kuhn, & Seara, 2019) enables the mean and covariance in Eqs. ( 3) and ( 4) to converge at different rates to a predetermined functional form.This customization is obtained by making Q = γ 0 = β 1 = 1 and γ 1 = β 0 = 0 in Eq. ( 2), and choosing the following expressions for the TVAR basis functions: and where p(n) in an arbitrary functional form characterizing the time-varying behavior of the mean vector and covariance matrix.Here, p(n) is called the TVAR interpolation function.Moreover, r 1 , r 2 , and λ are real constants meeting the following requirements: R1) 0 < {r 1 , r 2 } < 1 and R2) λ ̸ = 0.By substituting the values of Q, α 0 , α 1 , β 0 , β 1 , and the expressions for f 0 (n) and f 1 (n) into Eqs.( 2) and (1), we get By using the same substitutions in Eqs. ( 3) and ( 4), it can be shown that the following expressions for the mean vector and covariance matrix of Eq. ( 7) can be obtained: and Since the value of x(0) in Eq. ( 8) is arbitrary (see A3), one can define it as x(0) = γ x(0), where γ is an arbitrary real constant and x(0) is a basis vector.Then, Eq. ( 8) becomes where γ is assumed to meet requirement R3) γ ̸ = 0.By looking at Eqs. ( 10) and ( 9), it can be seen that parameters γ and λ are simply constant gains, while r 1 and r 2 control the convergence of the mean and covariance towards their steadystate expressions (i.e., for large n) given by and Note that Eqs. ( 11) and ( 12) both depend on the TVAR interpolation function p(n), which is the most important quantity to be found.Hence, in this paper, we focus on obtaining p(n) and consider γ, λ, r 1 , and r 2 as fine-tuning parameters.Details on how to calculate the TVAR interpolation function and the proposed data augmentation method are given next.

Overview of the Proposed Method
Here, we use the TVAR model to represent the underlying stochastic process of the time series belonging to a given class (see A2).We consider that such a stochastic process can be described by empirical statistics computed from the data.Thus, fitting the TVAR model amounts to finding parameters and interpolation functions that make the expressions for the first-and second-order moments of the TVAR process match the empirical statistics calculated for the data1 .Having found suitable TVAR parameters and interpolation function, we plug them back in the TVAR process formula (see ( 7)) to create randomized augmented data for a given signal class.

TVAR Sub-Models for the Mean and Covariance
As p(n) appears both in the mean and in the covariance expressions (see Eqs. ( 10) and ( 9)), we cannot resort to one TVAR model (based on a single p(n)) to describe time series in which mean and covariance change in distinct forms over time.To account for these cases, we propose to use two TVAR sub-models to characterize the underlying stochastic process of the signal class, one for the mean (TVAR m ) and another one for the covariance (TVAR c ).We let each sub-model to have its own interpolated expressions p m (n) and p c (n), as well as parameters {γ m , λ m , r 1m , r 2m } and {γ c , λ c , r 1c , r 2c }, which should be initialized with the sub-models and meet R1) to R3).These constants can be fine-tuned with Eqs.( 10) and ( 9) to match the non-stationary statistics of the data (See Section 3.3.1).We consider that the TVAR m and TVAR c submodels capture the first-and second-order dynamics of the time series of a given class, respectively.The final augmented signals are obtained by using the p m (n) and p c (n) and parameters for TVAR m and TVAR c in ( 7), which will lead to Eqs. ( 13) and ( 14), respectively, both of which are used to generated the augmented (new) synthetic time series.
The TVAR m and TVAR c processes in Eqs. ( 13) and ( 14) will more likely represent behaviors captured by the mean and covariance, respectively.The augmented signals have length N and are computed L times, with N and L being free parameters of the method.The mean vector and covariance matrix of x m (n) are shown in Eqs. ( 15) and ( 16), while Eqs.( 17) and ( 18) show the same quantities for x c (n), respectively.
Since we search analytical expressions for the augmented time series (like Eq. ( 7)), finding the TVAR interpolating functions stands for interpolating regression expressions for p m (n) and p c (n) over the empirical statistics calculated from the data.Next, we address the calculation of the empirical statistics from the data and the TVAR interpolation functions.

Computing the Empirical Statistics
The interpolation functions p m (n) and p c (n) are fitted to the empirical first-and second-order statistics computed from the data.As we assume the time series to be equally-sampled (see A1), we propose to compute the empirical mean vector and covariance matrix of the signals by means of ensemble averaging (Manolakis, Ingle, & Kogon, 2005).More precisely, let us define the collection of equally-sampled, multivariate time series Y (g) j belonging to a given class or group g as follows: where the time series Y (g) Based on C g , the empirical mean vector and covariance matrix at time n can be computed via ensemble averaging as and

Finding the Interpolation Functions
For the sake of simplicity, to present the interpolation method, we consider the time series datasets to be univariate (i.e., we make M = 1 in Eq. ( 20)).Also, we drop the superscript (g) characterizing the group or class of the time series.By doing so, Eqs. ( 21) and ( 22) become the univariate sequences m(n) and c(n), respectively.The results of this section can be extended to the multivariate case without loss of generality2 .
As explained in Section 3.2, sub-models TVAR m and TVAR c characterize the first-and second-order dynamics of the data separately.Therefore, p m (n) is interpolated solely to m(n), and p c (n) to c(n).Ahead, we address the calculation of the interpolation functions p m (n) and p c (n) by considering one approach based on sinusoidal regression via Discrete Fourier Transform (DFT), which allows modeling a wide range of sensor time series.However, other curve fitting techniques could be considered, like splines (de Boor, 2001) and other forms of linear (Montgomery, Peck, & Vining, 2006) and nonlinear regressions (Bates & Watts, 2007), as long as they return an analytical expression for the fitted curves.

Sinusoidal Regression
Many PHM time series have an oscillatory nature (e.g., vibration and electrical signals), which can be better characterized by a sinusoidal model.A simple sinusoidal regression for p m (n) and p c (n) can be obtained by computing a DFT-based decomposition of m(n) and c(n).Ahead, we show the steps to compute the regression expression only for p m (n), as the results for p c (n) will be the same (the only difference being replacing m(n) by c(n) in the calculations).Let the DFT of m(n) at discrete frequency ω k = 2πk be and by computing Eq. ( 23) for ω 0 , ..., ω N −1 (i.e., for all the N available time points), we can build the vector with the magnitude and phase spectra of m(n) for frequency values ω 0 , ..., ω N −1 .Note that we can reconstruct m(n) from Eq. ( 24) by computing the inverse DFT, i.e., Let us sort the vector f m in descending order according to the values of magnitude, obtaining f ′ m as sorted vector where ω ′ k is the sorted frequency variable, so that largest and smallest values of magnitude in f ′ m are at ω ′ 0 and ω ′ N −1 , respectively.If we recompute Eq. ( 25) by using the values of the sorted vector in Eq. ( 26) up to the P th element, we get as an approximation of m(n) with the P most significant frequencies of the spectrum (here, described by the P largest values of magnitude).By expressing (27) in terms of cosines and sines and taking the real part of the resulting expression, the final sinusoidal regression formula for p m (n) is obtained Finally, if we repeat the steps above for c(n), with |F c (ω ′ k )| being its sorted magnitude, we get the expression for p c (n)  Notice that the sinousidal regression can fit reasonably well the empirical statistic curves, especially for higher orders.
The proposed data augmentation procedure is given in Algorithm 1, which can be repeated for all class the user wants to augment.In the next section, we discuss the experiments to evaluate the data augmentation in anomaly detection settings.
Input: Collection C g = {Y (g) j } j=1,...,J of time series of a given class g to augment (see Eq. ( 19)) Output: i) Collection C g,aug of augmented time series for class g. ii) Expressions for x m (n + 1) and x c (n + 1) (see Eqs. ( 13) and ( 14)) Step 1: Initialize TVAR c and TVAR m parameters {γ m , λ m , r 1m , r 2m } and {γ c , λ c , r 1c , r 2c } (see R1 to R3), and perturbation noise correlation matrix Φ (see A4). Initialize time series length N and number of augmented samples to create L; Initialize curve interpolation order P ; Step 2: For C g , compute empirical statistics m (g) (n) and

EXPERIMENTAL STUDY
In this section, we evaluate the ability of the proposed data augmentation method to improve the classification performance of typical ML methods in anomaly detection settings.We have tested two ML model architectures, three public sensor time series datasets, and we have compared the proposed technique against a competing data augmentation approach.All the simulations have been carried out in Python.More details are given ahead about the experiment design choices.

Datasets
We have tested three public datasets of univariate time series that that can be considered for anomaly detection studies.These are the collection of bearing vibration signals released by Case Western Reserve University (CWRU dataset) (Case School of Engineering Bearing Data Center, 2023), the set of measurements from piezeoelectric (PZT) sensors from the 2019 Prognostics Health Management Data Challenge (PH-MDC2019 dataset) (He et al., 2013;Peng et al., 2015), and the Ford engine noise data (Ford A dataset) (Chen et al., n.d.).
The PHMDC2019 dataset contains Lamb waves from fatigue experiments on aluminum lap joints.For more details on the experiment, please check (He et al., 2013;Peng et al., 2015).The selected Lamb wave signals have been measured for eight specimens (specimen T1 to T8).For each specimen, different loads have been applied to the testing material, and a pair of signals from two sensors have been recorded (each signal with 4000 samples).Then, the crack lengths have been measured for each specimen, applied load, and pair of signals.Here, we have considered that a given signal sample corresponds to a "damaged" sample (Class 1) if the observed crack length exceeds 4 mm.Otherwise, the sample is considered as "normal" (Class 0).We have selected specimens T3 to T8 to build the train data and the remaining specimens to build the test.By doing so, we could guarantee that both train and test partitions could have data from Class 1.The sizes of the train and test splits for this dataset are shown in Table 1.Note that PHMDC2019 is the smallest dataset considered in this experiment.Using such a dataset allows the performance evaluation of the proposed method in small-data settings.
The CWRU dataset contains vibration signals collected from a drive-end bearing of an electrical motor operating under different loads.The selected vibration signals have been sampled at 12 kHz for motor speed of 1797 RPM (Case School of Engineering Bearing Data Center, 2023).The chosen signals are from a normal bearing (Class 0), a bearing with a crack in the inner race (Class 1), and another one with a crack in the ball (Class 2).The cracks of the faulty bearings have 0.007 inches.We have created signal snippets of 0.1 second from the vibration data of the bearings.To simulate challenging real-world scenarios, training samples from the faulty classes have been under-sampled to make the train data imbalanced.The number of samples obtained for the different partitions (i.e., train and test) and classes are shown in Table 1.
The Ford A dataset is a collection of signals for anomaly detection.The data has been originally proposed as a part of the Ford Classification Challenge, which appeared in competition program of the 2008 edition of the IEEE World Congress on Computational Intelligence (IEEE WCCI, 2008), and currently is made available by the UCR Time Series Classification Archive (Chen et al., n.d.).The Ford A dataset consists of noise signals collected from sensors installed on automotive engines that are either normal (Class 0), or present failure symptoms (Class 1).The numbers of signal samples available for each class and data partition are shown in Table 1.The signals have 500 points and are index time series (i.e., no information was provided about the sampling frequency).

Machine Learning Models
The augmentation methods have been evaluated in the context of anomaly detection via time series classification by two machine learning models, namely a convolutional neural network (CNN) and a random forest (RF).The CNN architecture has been originally proposed in (Z.Wang, Yan, & Oates, 2017) and made available in (Fawaz, 2020) as a Tensorflow/Keras tutorial on neural networks for time series classification.The model hyperparameters (number of filters, batch and kernel sizes, etc.) have been determined with the KerasTuner module (Keras, 2022), and the resulting architecture has been trained for 20 epochs.More details on the model structure and its hyperparameters can be found in (Fawaz, 2020).
The RF model is the standard random forest implementation offered by Python scikit-learn library (Buitinck et al., 2013).
The chosen RF architecture has 50 estimators and max depth of 5.The RF model is fitted to a tabular feature representation extracted from the time series data with the Python package tsfresh (Christ, Braun, Neuffer, & Kempa-Liehr, 2018).This module allows for extracting typical features from time series via a systematic framework.Further information on the available features and calculations implemented by tsfresh can be obtained in (Christ, Braun, Neuffer, & Kempa-Liehr, 2023).

Data-Augmentation Procedures
In this study, we have compared the classification performance obtained by using the proposed augmentation method against an alternative approach available in the literature.
The competing data augmentation approach is given by the Python library TSAug (Arundo Analytics, 2023), which offers a suite of typical augmentation transformations for time series.For the simulations, we have chosen some of the default transformation effects selected by the TSAug authors in (Arundo Analytics, 2023) to exemplify the capabilities of the library.These are i) addition of white Gaussian noise (WGN) to the time series with varying values of standard deviation (here, set to vary from ±10% of the standard deviation estimated from the data), ii) random drop of %10 of the data points followed by filling with zeros, iii) random drift of a sequence of data points up and down, and iv) reduce of the tem-  poral resolution (sampling) of some sequence of data points.
For iii) and iv), we have considered a sequence of up to five data points.These transformations simulate random fluctuations and artifacts that commonly take place when processing batches of sensor data.Given an input time series, the TSAug library randomly applies transformations i) to iv) to create a new synthetic time series with the random artifacts.
The proposed augmentation (TvarAug) has been computed  28) and ( 29)).For each interpolated curve, we have created the TVAR m and TVAR c expressions (see Eqs. ( 13) and ( 14)) by considering the sets of parameter values3 given in Table 2.The noise of the TVAR expressions has been set up a WGN(0,1) process.Examples of the augmented time series obtained by using these parameters values for TVAR m and TVAR c sub-models are shown in Figure 2. Notice that, overall, the augmented data can capture relevant patterns of the dynamics of the original time series, while introducing the level of stochasticity necessary for generating new synthetic signal segments.

Test Results
The TVAR augmentation (TvarAug) and the TSAug approach have been employed to create new synthetic faulty samples belonging to the training partitions of the three considered datasets (see Section 4.1).These signal samples have been augmented considering the following augmentation fractions: 0.1, 0.25, 0.5 and 1.The CNN and RF models have been trained in the original and augmented train data partitions, and tested considering the available test data.The evaluation in the test data has been performed by computing the accuracy score of classifying the time series samples into the considered Faulty/Normal categories.For the TvarAug method, the average test accuracy has been computed for the different values of P considered.The obtained accuracy values are shown in Table 3, with the best results for each case highlighted in bold.The use of TvarAug has improved the classification when compared to the original data (NoAug) in all cases except the CWRU using RF where the original data already yields very good performance.As expected, the application of data augmentation has improved the overall detection performances of the ML models in all cases which are more challenging.The proposed method outperforms the competing approach for all considered datasets and augmentation fractions, which evidences its superior performance.

CONCLUSION
This paper presented a novel method for data-augmentation which can be effectively employed for fault diagnostics or failure prognostics applications as it can adequately be Table 3. Results as average values of test accuracy computed over different values of P , where "NoAug" stands for no data augmentation (original dataset), "TSAug" is the augmentation given by the TSAug library, and "TvarAug" is the proposed method.Results are shown for different datasets, augmentation fractions and machine learning models.used for generation of artificial samples of multivariate nonstationary time-series data.The method leverages timevarying autoregressive models for this purpose and artificial samples are generated based on mean and covariance estimates obtained from the available real samples.The method was successfully tested based on failure diagnostics applications.Three publicly available real world datasets and two different machine learning methods were employed in the experiments and the proposed methodology provided improved results in almost all cases.
As the observed diagnostics results are promising, the proposed method could be further tested in more complex PHM tasks.Future work will include the application of the methodology for failure prognostics problems, as for this kind of application the scarcity of data is in general even more limiting compared to diagnostics.

Figure 1 .
Figure 1.Example of sinusoidal interpolation p m (n) and p c (n) of the empirical mean and variance for the Ford A dataset.Interpolation orders considered are (a) and (b) P = 4, (c) and (d) P = 10, and (e) and (f) P = 14.

Figure 1
Figure 1 illustrate the application of the sinusoidal interpolation to the empirical mean and variance curves compute for the time series belonging to the Ford A dataset (see Section 4.1 for details on this dataset).Three different orders have been considered for p m (n) and p c (n), namely P = 4 (Figure 1 (a)), P = 10 (Figure 1 (b)), and P = 14 (Figure 1 (c)).Notice that the sinousidal regression can fit reasonably well the empirical statistic curves, especially for higher orders.
obtain one pair of scalar functions p m (n) and p c (n) per element of the array (See Section 3.3); Store pair(s) of p m (n) and p c (n);

Figure 2 .
Figure 2. Examples of original and augmented time series for the datasets evaluated in this study.The augmented signals are generated by the fitted TVAR m and TVAR c sub-models with sinusoidal interpolation of order P = 14.(a) PH-MDC2019, (b) CWRU, and (c) Ford A datasets.

Table 1 .
For the train and test splits, number of samples per dataset and per class used in the experiment of Section 4.

Table 2 .
Parameter of the TVAR m and TVAR c employed to created the synthetic signal used in the experimental study.