BUILDING SCALABLE DATA ARCHITECTURES FOR MACHINE LEARNING

Authors

  • Abhishek Vajpayee Metropolis Technologies, USA Author
  • Rathish Mohan Lore Health LLC, USA. Author
  • Vishnu Vardhan Reddy Chilukoori Amazon.com Services LLC, USA. Author

Keywords:

Scalable Data Architectures, Machine Learning, Big Data Processing, Data Engineering Integration, Cloud-Native Technologies

Abstract

This comprehensive article explores the critical role of scalable data architectures in machine learning, addressing the challenges posed by exponential data growth and increasing model complexity. It delves into the core components of such architectures, including data ingestion, storage, processing, and model deployment, while examining key architectural patterns like Lambda, Kappa, and Microservices. The article discusses various technologies and tools essential for implementing scalable ML infrastructures, and emphasizes the importance of integrating machine learning with data engineering processes. A case study on predictive maintenance in manufacturing illustrates the practical impact of these architectures, demonstrating significant improvements in equipment downtime reduction and cost savings

References

D. Reinsel, J. Gantz, and J. Rydning, "The Digitization of the World: From Edge to Core," IDC White Paper, Nov. 2018. [Online]. Available: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, May 2015. [Online]. Available: https://www.nature.com/articles/nature14539

J. Kreps, N. Narkhede, and J. Rao, "Kafka: A distributed messaging system for log processing," in Proceedings of the NetDB, 2011, pp. 1-7. [Online]. Available: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf

H. Fang, "Managing Data Lakes in Big Data Era: What's a data lake and why has it became popular in data management ecosystem," in 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 2015, pp. 820-824. [Online]. Available: https://ieeexplore.ieee.org/document/7288049

N. Marz and J. Warren, "Big Data: Principles and best practices of scalable realtime data systems," Manning Publications, 2015. [Online]. Available: https://www.manning.com/books/big-data

J. Kreps, "Questioning the Lambda Architecture," O'Reilly Media, Jul. 2014. [Online]. Available: https://www.oreilly.com/radar/questioning-the-lambda-architecture/

B. Dageville et al., "The Snowflake Elastic Data Warehouse," in Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16), 2016, pp. 215-226. [Online]. Available: https://dl.acm.org/doi/10.1145/2882903.2903741

M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, Nov. 2016. [Online]. Available: https://dl.acm.org/doi/10.1145/2934664

S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger, "Automating large-scale data quality verification," Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1781-1794, 2018. [Online]. Available: https://dl.acm.org/doi/10.14778/3229863.3229867

M. Zaharia et al., "Accelerating the Machine Learning Lifecycle with MLflow," IEEE Data Eng. Bull., vol. 41, no. 4, pp. 39-45, 2018. [Online]. Available: http://sites.computer.org/debull/A18dec/p39.pdf

J. Kreps, N. Narkhede, and J. Rao, "Kafka: A distributed messaging system for log processing," in Proceedings of the NetDB, 2011, pp. 1-7. [Online]. Available: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf

M. Abadi et al., "TensorFlow: A system for large-scale machine learning," in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265-283. [Online]. Available: https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

Downloads

Published

2024-08-06