Machine learning (ML)/Artificial Intelligence (AI) has widespread applications and has revolutionized many industries due to advanced and matured sensor technology, as well as large-scale data collection efforts. One of the key tasks for effective ML/AI operations is the extraction and identification of useful and usable data to identify complex interrelationships and solve problems efficiently. The usefulness of the data is the value and meaning of the data within the desired model, while the usability of the data refers to the ease of use of data in a model. Complex supervised and unsupervised ML models, which used to be the domain of cutting-edge scientists and academics, can now be invoked as a basic function calls in public domain packages within Python, R, MATLAB, and other languages. While these functions require effective data preprocessing to overcome the unpredicted impacts of data quality in the real world (e.g. missing data, environmental noise, synchronizing at different sampling rates, etc.), their ease of use means they are often called with little to no understanding of the underlying math or ways to efficiently work through the data set. The approachability provided by the packages enables users to dive into complex problem sets with little advance preparation. However, in doing so there is a lack of understanding which will inevitably cause problems, skew results, or force the user to take a less efficient path to get to a similar answer. Each package provides relatively simple examples that deal with specific public data sets, yet not many provide the background knowledge and comprehensive methods required for building the inputs for extensive and effective time-series data modeling. Typically, the complex nature of time-series data requires an in-depth understanding of signals analysis and domain subject expertise to use in ML/AI predictive models. This paper will provide the reader an overview of the problems associated with time-series data modelling, propose a common set of preprocessing steps to follow, demonstrate a taxonomy classification for time series data, provide introductory reasoning regarding the underlying process, and discuss the models that would benefit from such a methodology. This is done here with the goal of equipping non-knowledge-domain experts with updated and approachable techniques to find which features to focus on while preprocessing for their time-series data preparation efforts.
How to Cite
Machine Learning (ML) / Artificial Intelligence (AI), Supervised and Unsupervised ML, Data preprocessing, time series data, knowledge domain, probability distribution, feature extraction and selection, data preparation
Baumann, E., Forero, P. A., Selby, G., & Hsu, C. (2021). Methods to improve the prognostics of time-to-failure models. In Annual conference of the phm society.
Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: Forecasting and control. Holden-Day.
Esling, P., & Agon, C. (2012, dec). Time-series data mining. ACM Comput. Surv., 45(1). Retrieved from https://doi.org/10.1145/2379776.2379788 doi: 10.1145/2379776.2379788
Han, J. (2011). Data mining: Concepts and techniques, 3rd ed. Morgan Kaufmann.
Jones, P. R. (2019). A note on detecting statistical outliers in psychophysical data. Attention, Perception & Psychophysics,
Keijzer, D. A., Keulen, V. M., & Dekhtyar, A. (2007). Report on the first vldb workshop on management of uncertain data (mud). (Tech. Rep.).
Kruger, F. (2016). Activity, context, and plan recognition with computational causal behaviour models (Unpublished doctoral dissertation). Universitat Rostock.
Kumar, V., & Minz, S. (2014, Jun). Feature selection: A literature review. Smart Computing Review, 4(3), 211-
229. Lines, J., & Bagnall, A. (2015). Time series classification with ensembles of elastic distance measures. Data
Mining and Knowledge Discovery, 29(3), 565-592. Profillidis, V., & Botzoris, G. (2019). Modeling of transport demand: Analyzing, calculating, and forecasting transport demand. Elsevier.
Pukelsheim, F. (1994). The three sigma rule. The American Statistician, 48(2), 88–91. Retrieved 2023-06-04, from
Radzuan, N. F. M., Othman, Z., & Bakar, A. A. (2013). Uncertain time series in weather prediction. In Procedia
Teng, C. M. (1999). Correcting noisy data. In 16th international conference on machine learning.
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
The Prognostic and Health Management Society advocates open-access to scientific data and uses a Creative Commons license for publishing and distributing any papers. A Creative Commons license does not relinquish the author’s copyright; rather it allows them to share some of their rights with any member of the public under certain conditions whilst enjoying full legal protection. By submitting an article to the International Conference of the Prognostics and Health Management Society, the authors agree to be bound by the associated terms and conditions including the following:
As the author, you retain the copyright to your Work. By submitting your Work, you are granting anybody the right to copy, distribute and transmit your Work and to adapt your Work with proper attribution under the terms of the Creative Commons Attribution 3.0 United States license. You assign rights to the Prognostics and Health Management Society to publish and disseminate your Work through electronic and print media if it is accepted for publication. A license note citing the Creative Commons Attribution 3.0 United States License as shown below needs to be placed in the footnote on the first page of the article.
First Author et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.