Analysis of data quality issues in real-world industrial data
##plugins.themes.bootstrap3.article.main##
##plugins.themes.bootstrap3.article.sidebar##
Abstract
In large industries usage of advanced technological methods and modern equipment comes with the problem of storing, interpreting and analyzing huge amount of information. Handling information becomes more complicated and important at the same time. So, data quality is one of major challenges considering a rapid growth of information, fragmentation of information systems, incorrect data formatting and other issues. The aim of this paper is to describe industrial data processing and analytics on the real- world use case. The most crucial data quality issues are described, examined and classified in terms of Data Quality Dimensions. Factual industrial information supports and illustrates each encountered data deficiency. In addition, we describe methods for elimination data quality issues and data analysis techniques, which are applied after cleaning data procedure. In addition, an approach to address data quality problems in large-scale industrial datasets is proposed. This techniques and methods comprise several well-known techniques, which come from both worlds of mathematical logic and also statistics, improving data quality procedure and cleaning results.
How to Cite
##plugins.themes.bootstrap3.article.details##
Data quality in industry, Data Quality Dimensions
Batini, C., & Scannapieca, M. (2006). Data quality: concepts, methodologies and techniques. Springer.
Bergdahl, M., Ehling, M., Elvers, E., Földesi, E., Körner, T., Kron, A., and others (2007). Handbook on data quality assessment methods and tools, 9–10.
Buechi, M., Borthwick, A., Winkel, A., & Goldberg, A. (2003) ClueMaker: A Language for Approximate Record Matching. IQ, 207-223.
Corporation, A., & Consulting, W. M. (2011). Data quality in the insurance market.
Foken, T., Göockede, M., Mauder, M., Mahrt, L., Amiro, B., & Munger, W. (2005) Post-field data quality control. Handbook of micrometeorology, Springer,
181-208.
Gendron, M. S., & D’Onofrio, M. J. (2001). Data quality in healthcare industry. Data Quality, 7(1), 23–31.
Kahn, B. K., Strong, D. M., & Wang, R. Y. (2002). Information quality benchmarks: product and service performance. Communications of the ACM, 45(4), 184–192.
Laudon, K. C. (1986). Data quality and due process in large inter-organizational record systems. Communications of the ACM, 29(1), 4–11.
Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 26(8), 585–606. Optique. (2012). Optique: project description. Retrieved November, 2012, from CVS: "http://www.optique-project.eu/ about-optique/about-optique/".
Pipino, L. L., Lee, Y.W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3–13.
Safran, D. G., Kosinski, M., Tarlov, A. R., Rogers, W. H., Taira, D. A., Lieberman, N., & Ware, J. E. (1998). The primary care assessment survey: tests of data quality and measurement performance. Medical care, 36(5),728–739.
Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110.
Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for information integration. Information Systems, 26(8), 607–633.
Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.
Wang, R. Y., Strong, D. M., & Guarascio, L. M. (1996). Beyond accuracy: What data quality means to data consumers. J. of Management Information Systems, 12(4), 5–33.
Winkler, W. E. (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau.
Winkler, W. E. (2004). Methods for evaluating and creating data quality. Information Systems, Elsevier, 29, 531- 550.
Yan, S., Lee, D., Kan, M.-Y., & Giles, L. C. (2007) Adaptive sorted neighborhood methods for efficient record linkage. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 185-194.
The Prognostic and Health Management Society advocates open-access to scientific data and uses a Creative Commons license for publishing and distributing any papers. A Creative Commons license does not relinquish the author’s copyright; rather it allows them to share some of their rights with any member of the public under certain conditions whilst enjoying full legal protection. By submitting an article to the International Conference of the Prognostics and Health Management Society, the authors agree to be bound by the associated terms and conditions including the following:
As the author, you retain the copyright to your Work. By submitting your Work, you are granting anybody the right to copy, distribute and transmit your Work and to adapt your Work with proper attribution under the terms of the Creative Commons Attribution 3.0 United States license. You assign rights to the Prognostics and Health Management Society to publish and disseminate your Work through electronic and print media if it is accepted for publication. A license note citing the Creative Commons Attribution 3.0 United States License as shown below needs to be placed in the footnote on the first page of the article.
First Author et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.