Text Classification and Tagging of United States Army Ground Vehicle Fault Descriptions in Support of Data-Driven Prognostics

The manner in which a prognostics problem is framed is critical for enabling its solution by the proper method. Recently, data-driven prognostics techniques have demonstrated enormous potential when used alone, or as part of a hybrid solution in conjunction with physics-based models. Historical maintenance data constitutes a critical element for the use of a data-driven approach to prognostics, such as supervised machine learning. The historical data is used to create training and testing data sets to develop the machine learning model. Categorical classes for prediction are required for machine learning methods; however, faults of interest in US Army Ground Vehicle Maintenance Records appear as natural language text descriptions rather than a finite set of discrete labels. Transforming linguistically complex data into a set of prognostics classes is necessary for utilizing supervised machine learning approaches for prognostics. Manually labeling fault description instances is effective, but extremely time-consuming; thus, an automated approach to labelling is preferred. The approach described in this paper examines key aspects of the fault text relevant to enabling automatic labeling. A method was developed based on the hypothesis that a given fault description could be generalized into a category. This method uses various natural language processing (NLP) techniques and a priori knowledge of ground vehicle faults to assign classes to the maintenance fault descriptions. The core component of the method used in this paper is a Word2Vec word-embedding model. Word embeddings are used in conjunction with a token-oriented rule-based data structure for document classification. This methodology tags text with user-provided classes using a corpus of similar text fields as its training set. With classes of faults reliably assigned to a given description, supervised machine learning with these classes can be applied using related maintenance information that preceded the fault. This method was developed for labeling US Army Ground Vehicle Maintenance Records, but is general enough to be applied to any natural language data sets accompanied with a priori knowledge of its contents for consistent labeling. In addition to applications in machine learning, generated labels are also conducive to general summarization and case-bycase analysis of faults. The maintenance components of interest for this current application are alternators and gaskets, with future development directed towards determining the remaining useful life (RUL) of these components based on the labeled data.

Transforming linguistically complex data into a set of prognostics classes is necessary for utilizing supervised machine learning approaches for prognostics. Manually labeling fault description instances is effective, but extremely time-consuming; thus, an automated approach to labelling is preferred. The approach described in this paper examines key aspects of the fault text relevant to enabling automatic labeling. A method was developed based on the hypothesis that a given fault description could be generalized into a category. This method uses various natural language processing (NLP) techniques and a priori knowledge of ground vehicle faults to assign classes to the maintenance fault descriptions.
The core component of the method used in this paper is a Word2Vec word-embedding model. Word embeddings are used in conjunction with a token-oriented rule-based data structure for document classification. This methodology tags text with user-provided classes using a corpus of similar text fields as its training set. With classes of faults reliably assigned to a given description, supervised machine learning with these classes can be applied using related maintenance information that preceded the fault.

INTRODUCTION
The primary method of prediction in prognostics is the use of physics-based models, which formally model the mechanical system in question from a priori knowledge of the system (Batzel & Swanson, 2009;Yang, Ito, Yang, & Liu, 2016). These approaches are often designed for a specific component or maintenance event in question, and thus do not always generalize to new prognostic problems (Aivaliotis, 2013;Mccollom & Worth, 2011;Terrissa, Meraghni, Bouzidi, & Zerhouni, 2016). One caveat to the application of machine learning methods is that labelled example instances are required to create a model. Furthermore, the quality of the example instances directly affects the accuracy of the model.
In order to perform prognostics, a point of instance has to be determined for prediction (Batzel & Swanson, 2009). A datadriven modeling approach to prognostics necessitates previous examples of these instances to infer a set of rules and parameters to use for making predictions on new instances (Qu, Liu, Ma, & Fan, 2019). Having these instances in a consistent and discrete format is more useful than having them in a variable and fuzzy format. The mechanical faults in US Army Ground Vehicle Maintenance Records are recorded in natural language text, which, while more descriptive, does not lend itself to discrete prediction in a data-driven approach. A means for determining labels from the descriptions is needed, and this paper presents a method for accomplishing that.
Vehicle fault descriptions for maintenance events employ different terms and jargon when recorded. The terminology used by technicians and mechanics varies, but has a consistent structure, with given terms used either in conjunction or as synonyms to other terms. Terms in the description are associated with the faults described; hypothetically, each fault within a class of faults has a number of terms used to identify it. Determining these sets of terms allows for the identification of a fault label from a fault description by comparing the terms in the description to the set of terms correlated with the fault.
With this understanding, we propose a method of identifying these terms and using them to identify a description as belonging to a class of zero or more faults. The methodology for this uses a multi-step process integrating various techniques in text analysis and NLP. The core element used in this approach is the Word2Vec (W2V) word-embedding algorithm (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013), which transforms the fault text corpus into a numeric vector space by which contextually similar terms identify a rule set around a single term that is used to identify the class. The rule set then serves as the data structure to compare sets of tokens found in the fault descriptions. The association of a fault description with a class depends on how the tokens compare to each other based on the rule set.
Applications for this classification function include creating labels for training samples for use by machine-learning algorithms and for general text summarization. The labels created in the context for machine learning can serve as either the class for prediction or as a feature. For future research on text classification and tagging, the primary goal is to use these models for creating classes for prediction by a prognostics model. The labels also serve as a good means for summarizing an entry -that is, for understanding the theme of an overall description and its underlying semantics. The generalizability of the approach also allows it to apply in circumstances other than fault labeling. All of the applications listed above generalize to any task requiring discrete class labeling or textural summarization based on a defined corpus.

BACKGROUND
The modern precursor of prognostics dates to the 1930's -40's as corrective maintenance activities. This means simply that when something broke, someone fixed it. In the early 1960's the concept of preventative maintenance was introduced with the first-generation aircraft health management system used in the B727, and B737 classic (Wheeler, Kurtoglu, & Poll, 2010). Approximately fifteen years later -in the mid-1980's -predictive practices made an appearance. Furthermore, it was not until the early 2000's that condition-based maintenance and prognostics began to be used in predicting the remaining useful life (RUL) of specific parts of a system. Clearly, the field of prognostics and health management (PHM) is still very young, and there is much room for more research.
On the other hand, there has been a notable amount of research performed in the field of PHM that is focused on issues like crack growth of wind turbines and other rotary vehicles (Corbetta, Sbarufatti, Manes, & Giglio, 2015), lithium-ion battery life (Tang, Hettler, Zhang, & DeCastro, 2014;Xing, Williard, Tsui, & Pecht, 2011), gearbox health management (Jiang et al., 2019), and even human health and performance (Ahmetov et al., 2008). A plethora of different methods of prediction have been used to accurately describe the RUL of sensor systems, engineered resilient systems, and others. The most common methods or models of prediction fall under just a few categories: physics-based models, datadriven models, and a hybrid approach of these. Some tools used to drive these methods include particle filtering, regression-based models, neural networks, NLP, Kalman filtering, Markovian Process-based models, and threshold regression models.

The Importance of PHM
Using these tools with real time data in physics-based and/or data-driven models allows us to make predictions with greater accuracy, providing the time necessary to make wise decisions in the maintenance of a ground vehicle, aircraft, naval vessel, or sensor system, etc. Time is paramount when making decisions about saving money and maintaining reliability and safety of a vehicle. Critical issues for consideration include the vehicle availability, cost-benefit balance, when and where to replace or maintain components, and best approaches to maintenance.
As the drive to produce innovative technology continues, the cost and complexity of the technology increases, giving rise to the importance of PHM. There are now military weapon platforms that have built-in PHM capabilities (Goebel et al., 2017). There are numerous other systems that are being designed to have similar capabilities onboard to aid in maintaining a healthy balance between cost and benefit. In fact, research has been performed on considerations of sensor system selection that will drive these capabilities (Cheng, Azarian, & Pecht, 2010).
In the realm of Army Ground Vehicles (GV), PHM has great application in predicting the RUL of transmission gearboxes, power steering units, batteries, suspensions, and more. Conducting PHM on Army Ground Vehicles gives the Department of Defense (DOD) the opportunity to save money, increase availability of ground vehicles, and better ensure the reliability and safety of these vehicles. In the following sections, we will describe a process for creating a training set data from US Army GV maintenance logs to create a machine learning algorithm that will predict the RUL of GV alternators. This endeavor supports the DOD's implementation of Condition Based Maintenance (CBM) under the DOD instruction 4151.22 (Bell, 2008).
According to this instruction, CBM should be a principal consideration to implement proper maintenance practices for US military systems. Based on the implementation of CBM to US military systems, the DOD intends to reduce the maintenance costs and improve the management of their assets. A complete CBM system consists of eight infrastructure areas: sensors, data management, condition monitoring, health assessment, analytics, decision support, human interfaces, and communications (Bell, 2008).
PHM serves as a part of the analytics aspect of CBM through prognostics assessment. Therefore, properly categorizing the US Army GV maintenance logs is critical for our downstream goal of utilizing the health and usage monitoring system (HUMS) data to support CBM.

Previous Works
In this section we examine previous work performed in the area of using HUMS data to develop machine learning algorithms to predict RUL of GV systems. These works helped guide the early stages of planning and provide information on available tools, the current state of technology and research in this field, and techniques that have been used to solve a given problem.
Alternator components have received considerable research attention in the field of PHM. Oh, Azarian, Pecht, White, Sohaney, and Rhem (2010) propose a physics-of-failure (PoF) approach for fan PHM in electronics applications. Cui, Shi, and Zhang (2017) present a method of fault detection for rotating rectifier (RR) of aircraft generators. Nadarajan, Panda, Bhangu, and Gupta (2015) developed a hybrid model for a wound rotor synchronous generator to detect and diagnose faults in stator windings. These various modelbased approaches focus on detecting specific component faults. One main limitation of these approaches is that they may not be feasible to implement for a complete complex system. As an example, the vehicle power generator has received research attention. Hardware supported experiment data was used for the prognostics of the generator (Bayba, Siegel, & Tom, 2012). However, alternator failure is complicated. It may be caused by a series of components' cooperation, and even by the work environment's ambient temperature and overload (Puzakov, 2020). For military vehicles, these methods are not sufficient since they operate in extreme environments with large electrical loads connected to the vehicle (Banks, Reichard, Hines, & Brought, 2008). Thus, developing a robust prognostic algorithm based on a large fleet of vehicles has attracted researchers' attention (Du & Zhang, 2018). One requirement for developing these robust prognostic algorithms is the use of maintenance records containing fault descriptions to determine prominent faults within a GV system. A method for extracting this information is presented in the following sections.

METHODOLOGY
Ground Vehicle maintenance data is recorded as textual information in the form of logbook entries, where each entry is associated with a single maintenance event. This text is of two categories: (1) non-restricted, or natural language text, which provides a description of the maintenance event, and (2) restricted, or categorical, text, which consists of a limited number of possible entries representing categories of maintenance. Given that the fault descriptions in the logs are non-restricted text, the analysis of this text necessitates the use of NLP techniques. NLP is a collection of language analysis tools used for analyzing and evaluating naturally produced text that contains linguistic and grammatical structures.

Early Exploration
The first steps undertaken were data cleaning and exploration. The original GV maintenance data contained 161,864 observations of 57 variables. Included in these variables were fault descriptions, correction narratives, and vehicle families. After further exploration, it was noted that one particular family of vehicles appeared to be the most expensive to maintain; therefore, early explorations of this data focused mainly on this family and was later expanded to the entire dataset. After exploring additional variables, it was clear that the fault descriptions and correction narratives would be the two most valuable variables in the data for our purpose.
The first step of exploring these fault and correction variables was to take away non-unique entries based on their vehicle identification number, maintenance occurrence, and maintenance date. The next step was to remove unwanted punctuation, change all entries to lower case letters, and remove stop words, e.g., "a," "an," "the," "for," etc., so that a corpus of words could be created that were unique per entry. This corpus would also be void of the most commonly used words that added no "value" to the entry. For instance, the phrase "the windshield is cracked" would yield "windshield cracked". This cleaned dataset was used to create several ngrams used to find the frequency of the most common words, pairs of words, and groups of three words in the dataset. The results showed that there were indeed words more commonly used than others, implying that there were components or problems that stood out in the selected family of vehicles. By determining which components showed up most often, and comparing these components with cost and man hours, it was determined that the engine and electrical system are two areas for which improved maintenance could be very impactful. The next step was to develop a plan to logically associate the maintenance log data with the sensor data that corresponded to the maintenance event on a component.

Text Tagging
The goal of the text tagging task is to extract entries of a particular category from the natural language text of the GV fault descriptions to create training data for a machinelearning model. The first step is to determine the set of categories that exists in the maintenance log, and then use a rule-based system to decide which categories apply to which entries based on the fault description text. Given that the fault descriptions and rule-base for a particular malfunction are mentioned plainly through the vocabulary of the text, the rule-base for a particular fault is represented as a series of tokens (words or abbreviated words); if the tokens are present in the fault description text to some degree, that record will be classified within the fault category. We used this method to identify particular types of faults from the maintenance log. Subsequently, we use the fault categories to train a model with related vehicle operational and maintenance log data to predict on the same fault category.
The next task is to find the collection of tokens to form a rulebase for the text classification. The GV fault text entries are English sentences and have a consistent structure for tokenization. The tokens themselves are the words separated by spaces in the text. After this text is converted from the raw text to a series of tokens, vectorization can be used to find numerical relations between tokens based on their relative ordering. With these relationships established, the vector synonyms adapt to form a rule-base for classifying token groups in the maintenance log entries.
The preprocessing stage of NLP consists of cleaning the fault description text, putting it into a standard format, and then tokenizing it. The purpose of textual preprocessing is to reduce a text to its most basic characteristics and to remove any unnecessary components while maintaining the basic structure of the text. This basic structure of the text is comprised of tokens and their ordering. The process of cleaning the text includes removing numbers, punctuation, special characters, and stop words. The stop words used are from the Natural Language Toolkit (NLTK) library (Bird, Klein, & Loper, 2009). Format standardization includes removal of unnecessary spacing, conversion of all characters to the same case, lemmatization, and finally tokenization. After this process, the text is no longer in its original character sequence format but is a set of atomic units, i.e., tokens. This process is shown in Figure 1.
The tokens of the GV fault descriptions text are vectorized to create a word embedding based on the tokens' relative ordering. The word embedding is a mapping of the tokens to a vector space; that is, the underlying meaning and relationship of the original text is captured in a numeric format. This is an essential step, because machine learning models can operate only on numeric data. W2V, a two-layer neural network that produces a vector space from a corpus of text, is the algorithm used to produce the word embedding for this work. The word embedding quantitatively maps words based on similarity. This provides the ability to discover which words in the corpus have a high relational value between each other, e.g., synonyms, or words that are related to each other through specific semantic contexts. Figure 1. Overview of text tagging methodology, including steps for tokenization and preprocessing, vectorization, and rule base creation.
A graphical representation of a semantic vector space is shown in Figure 2. Similar tokens often appear in comparable positions within the corpus of text and can be used to identify a category of entries. For a given class believed to exist in the corpus of text, a set of related tokens is found in the space based on proximity. This set of similar tokens, along with the original term, is used to create the rule-base to find categories of faults within the text. To create a rule-base set, a group of class tokens is compiled based on the classes believed to exist in the corpus.
Each class token in the rule-base set has a subgroup of associated tokens representing synonyms of the class token. The rule-base set is then compiled using the word embeddings produced by the W2V algorithm.
When tagging a section of fault tokens with a maintenance label, the criteria for whether a tag applies to the text or not is based on whether it shares tokens between the tags' rulebase and the token set for the fault text. For the current implementation, the number of tokens needed in common between the rule-base and fault token set must be one or more for it to be classified under that tag. This constitutes a fault text entry. A fault text entry can have zero or more tags applied to it depending on whether the tags match any of the rule-base criteria. These tag sets represent the final classification of the maintenance entries.

RESULTS
We performed two types of tests with the text tagging method to determine its efficacy. The first test was a general comparison of the method's performance when applied to each the three separate text columns in the maintenance data set: Fault Description, Correction Narrative, and Remarks. Performing this test requires applying the text tagging methodology across each of the listed columns and comparing the relative frequency of tags in each column to determine if the method tags consistently with related entries for the same maintenance instance. The basic hyperparameters used in this instance include full text preprocessing, a rule base size of five tokens (not including class), and a class set of twenty (several of the most common occurrences are displayed in Table 1).
To determine how tag frequencies compare across different text columns within the same set of instances, we use correlation analysis to see if the frequency of tag frequencies follows a similar pattern between each pair. Table 2 shows the correlation matrix for the text from the Fault Description, Correction Narrative, and Remarks fields.
For this particular test, there is a high degree of similarity between each text column pair, demonstrating that labeling is consistent across text column instances.
The second set of experiments pertains to the differentiation of results when hyper-parameters are changed; in particular, these experiments evaluate the effect on the rule base when including stop words and changing the window size of the Word2Vec model. There is a selection of six different class tokens in this comparison based on an a priori analysis of the maintenance logs and an understanding of common maintenance tasks. Eight different Word2Vec models were trained on the GV fault description data. These are divided into two sets: the set with stop words removed and the set that did not have stop words removed (Bird et al., 2009). Each set of four models has a window size ranging between two and five. For the analysis, we use a similarity matrix for comparing the different rule-bases created by the Word2Vec model. The values in each cell represent the percentage of shared tokens between the rule-bases created by each model for the particular class under examination. For each experiment a 'class' and an 'n_synonyms' is selected. The class is the keyword used to generated the rule-base using the Word2Vec models, while the n_synonyms is the number of similar tokens retrieved for the rule-base.
For example, to compare the similarity between the rulebases for class 'inoperative' for a model using a window size of two and a model using a window size of three with an n_synonyms value of five each, we would get the two rulebase sets: ['inop ', 'inoperable', 'broken', 'burnt', 'shorted'] for window size of two, and ['inop', 'inoperable', 'broken', 'burnt', 'unserviceable'] for a window size of three. Comparing the two sets, we see they share four out of their five words, giving them a similarity score of 80%.
This process was repeated for six different classes between all eight models where the n_synonyms variable was set to twenty. The words selected are from common fault types in the GV maintenance log data. They are: 'alternator', 'engine', 'suspension', 'transmission', 'battery', and 'tire'. Similar results were obtained for the six classes.  The similarity matrix shows that models trained with stop words have higher inter-model similarity than those trained without stop words. This supports the case for using the methodology with stop words intact, because having high model agreement when other parameters are controlled for suggests stronger synonym relationships for the tokens selected in the rule-base. The advantage to retaining stop words can be explained by the fact that they give better contextual clues for the tokens of interest when training the W2V model.

CONCLUSION
Prognostics for US Army Ground Vehicles is a research area of great significance. Improved maintenance practices based on advanced machine learning algorithms that leverage collected maintenance and operational data have the potential to greatly reduce maintenance costs, and improve fleet reliability and maintainability. Currently, there is no method for determining a subset of operational data that is related to a particular recorded maintenance event. Using NLP techniques on maintenance data to automatically create labelled operational data will facilitate the creation of large sets of training data that can be used to create data-driven prognostics models for vehicle components.
In this paper, we have presented a method for capturing a consistent set of maintenance labels from the maintenance logbook data. For future work, we will cross-correlate these labels with the operational data to produce a training set for prognostics algorithms.