Analysis of data quality issues in real-world industrial data



Thomas Hubauer Steffen Lamparter Mikhail Roshchin Nina Solomakhina Stuart Watson


In large industries usage of advanced technological methods and modern equipment comes with the problem of storing, interpreting and analyzing huge amount of information. Handling information becomes more complicated and important at the same time. So, data quality is one of major challenges considering a rapid growth of information, fragmentation of information systems, incorrect data formatting and other issues. The aim of this paper is to describe industrial data processing and analytics on the real- world use case. The most crucial data quality issues are described, examined and classified in terms of Data Quality Dimensions. Factual industrial information supports and illustrates each encountered data deficiency. In addition, we describe methods for elimination data quality issues and data analysis techniques, which are applied after cleaning data procedure. In addition, an approach to address data quality problems in large-scale industrial datasets is proposed. This techniques and methods comprise several well-known techniques, which come from both worlds of mathematical logic and also statistics, improving data quality procedure and cleaning results.

How to Cite

Hubauer, T. ., Lamparter, S., Roshchin, M. ., Solomakhina , N. ., & Watson, S. . (2013). Analysis of data quality issues in real-world industrial data. Annual Conference of the PHM Society, 5(1).
Abstract 180 | PDF Downloads 203



Data quality in industry, Data Quality Dimensions

ISO 13379-1, I. D. (2009). Condition monitoring and diagnostics of machines data interpretation and diagnostics techniques part 1: General guidelines. ISO, Geneva, Switzerland.

Batini, C., & Scannapieca, M. (2006). Data quality: concepts, methodologies and techniques. Springer.

Bergdahl, M., Ehling, M., Elvers, E., Földesi, E., Körner, T., Kron, A., and others (2007). Handbook on data quality assessment methods and tools, 9–10.

Buechi, M., Borthwick, A., Winkel, A., & Goldberg, A. (2003) ClueMaker: A Language for Approximate Record Matching. IQ, 207-223.
Corporation, A., & Consulting, W. M. (2011). Data quality in the insurance market.

Foken, T., Göockede, M., Mauder, M., Mahrt, L., Amiro, B., & Munger, W. (2005) Post-field data quality control. Handbook of micrometeorology, Springer,
Gendron, M. S., & D’Onofrio, M. J. (2001). Data quality in healthcare industry. Data Quality, 7(1), 23–31.

Kahn, B. K., Strong, D. M., & Wang, R. Y. (2002). Information quality benchmarks: product and service performance. Communications of the ACM, 45(4), 184–192.

Laudon, K. C. (1986). Data quality and due process in large inter-organizational record systems. Communications of the ACM, 29(1), 4–11.

Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 26(8), 585–606. Optique. (2012). Optique: project description. Retrieved November, 2012, from CVS: " about-optique/about-optique/".
Pipino, L. L., Lee, Y.W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218.

Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3–13.
Safran, D. G., Kosinski, M., Tarlov, A. R., Rogers, W. H., Taira, D. A., Lieberman, N., & Ware, J. E. (1998). The primary care assessment survey: tests of data quality and measurement performance. Medical care, 36(5),728–739.

Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110.

Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.

Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.

Wang, R. Y., Strong, D. M., & Guarascio, L. M. (1996). Beyond accuracy: What data quality means to data consumers. J. of Management Information Systems, 12(4), 5–33.

Winkler, W. E. (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau.

Winkler, W. E. (2004). Methods for evaluating and creating data quality. Information Systems, Elsevier, 29, 531- 550.

Yan, S., Lee, D., Kan, M.-Y., & Giles, L. C. (2007) Adaptive sorted neighborhood methods for efficient record linkage. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 185-194.
Poster Presentations