AI-DRIVEN DATA QUALITY MANAGEMENT: A SYSTEMATIC REVIEW OF AUTOMATED DETECTION AND CLEANSING METHODOLOGIES

Anupkumar Ghogare

Authors

Anupkumar Ghogare Savitribai Phule Pune University, India. Author

Keywords:

Data Quality Management, Artificial Intelligence, Anomaly Detection, Real-time Monitoring, Automated Data Cleansing

Abstract

The integration of Artificial Intelligence in data quality management has transformed traditional approaches to data validation and cleansing. This article presents a comprehensive examination of modern data quality systems, from real-time monitoring through sophisticated cleansing methodologies. The article analyzes streaming technologies including Apache Kafka, Flink, and Spark, evaluating their roles in quality monitoring. The article extensively explores the evolution from rule-based to AI-enhanced cleansing systems, examining probabilistic approaches like HoloClean and interactive frameworks like ActiveClean that have revolutionized data quality management. Through empirical analysis, the article demonstrates how AI-driven approaches achieve up to 85% improvement in accuracy while reducing manual intervention by 65%. The article also evaluates implementation considerations across technical and organizational dimensions, providing a framework for successful adoption of these technologies. The findings indicate that organizations implementing AI-driven quality solutions, particularly in automated cleansing, experience significant improvements in data accuracy, processing efficiency, and decision-making capabilities. The article contributes to the growing body of knowledge on automated data quality management by offering comprehensive implementation guidelines and identifying crucial success factors in the adoption of AI-enhanced cleaning solutions

References

Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). “Automating large-scale data quality verification. Proceedings of the VLDB Endowment”, 11(12), 1781-1794. https://doi.org/10.14778/3229863.3229867

Noghabi, S. A., Paramasivam, K., Pan, Y., Ramesh, N., Bringhurst, J., Gupta, I., & Campbell, R. H. (2017). “Samza: Stateful Scalable Stream Processing at LinkedIn. Proceedings of the VLDB Endowment”, 10(12), 1634-1645. https://doi.org/10.14778/3137765.3137770

Ahmad, S., Lavin, A., Purdy, S., & Agha, Z. (2017). “Unsupervised real-time anomaly detection for streaming data. Neurocomputing”, 262, 134-147. https://doi.org/10.1016/j.neucom.2017.04.070

Chandola, V., Banerjee, A., & Kumar, V. (2009). “Anomaly Detection: A Survey. ACM Computing Surveys”, 41(3), 1-58. https://doi.org/10.1145/1541880.1541882

Krishnan, S., Wang, J., Wu, E., Franklin, M. J., & Goldberg, K. (2016). "ActiveClean: Interactive Data Cleaning For Statistical Modeling." Proceedings of the VLDB Endowment, 9(12), 948-959. https://doi.org/10.14778/2994509.2994514

Heidari, A., McGrath, J., Ilyas, I. F., & Rekatsinas, T. (2019). "HoloClean: Holistic Data Repairs with Probabilistic Inference." Proceedings of the VLDB Endowment, 12(12), 2048-2051. https://dl.acm.org/doi/10.14778/3137628.3137631

Rahm, E. (2016). "Data Quality: The Role of Empiricism." ACM SIGMOD Record, 45(4), 35-43. https://dl.acm.org/doi/10.1145/3186549.3186559

Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). "Data quality assessment: The Hybrid Approach." Communications of the ACM, 45(4), 211-218. https://doi.org/10.1145/505248.506010

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 1-15. https://doi.org/10.1145/3411764.3445518

AI-DRIVEN DATA QUALITY MANAGEMENT: A SYSTEMATIC REVIEW OF AUTOMATED DETECTION AND CLEANSING METHODOLOGIES

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

cover