DATA-CENTRIC APPROACHES TO AI MODEL TRAINING IN HEALTHCARE FOR ENHANCED PREDICTIVE ACCURACY
Keywords:
Artificial Intelligence, Machine Learning, Training Data, Data Collection, Data Cleaning, Feature Engineering, Data Splitting, Model Selection, Model Training, Model Evaluation, DeploymentAbstract
This paper outlines a structured workflow for developing artificial intelligence (AI) models in healthcare, emphasizing the need for rigorous processes to ensure robustness, accuracy, and practical applicability. The workflow begins with data collection, highlighting the importance of acquiring high-quality, diverse datasets. This is followed by meticulous data cleaning to prepare the data for effective training. The discussion then transitions to feature engineering, where raw data is transformed into formats suitable for models. The model development process includes data splitting to divide data into training, validation, and testing sets, facilitating the model’s learning and generalization capabilities. The paper details how model selection is customized to meet specific task requirements, considering data type, complexity, and necessary interpretability. The training phase focuses on tuning the model’s parameters to reduce prediction errors and implement strategies to avoid overfitting. Evaluation tests the model against unseen data to confirm its reliability and accuracy under operational conditions. The deployment phase is the culmination of the process, integrating the model into real-world environments where it can significantly impact, emphasizing the necessity of thoughtful deployment strategies, scalability, and ongoing maintenance. Ethical considerations, particularly data privacy and bias mitigation, are underscored as crucial to ensuring the ethical deployment of AI in healthcare. The paper aims to guide researchers and practitioners in creating effective and ethically sound AI models, thus enhancing AI’s potential to transform healthcare by improving research capabilities and clinical outcomes. This comprehensive approach ensures that AI models are scientifically robust, technically proficient, practically beneficial, and ethically responsible in real-world applications.
References
Bekbolatova M, Mayer J, Ong CW, Toma M. Transformative Potential of AI in Healthcare: Definitions, Applications, and Navigating the Ethical Landscape and Public Perspectives. Healthcare (Basel). 2024 Jan 5;12(2):125. doi: 10.3390/healthcare12020125. PMID: 38255014; PMCID: PMC10815906.
Aroyo, L., Lease, M., Paritosh, P., & Schaekermann, M. (2022). Data excellence for AI: why should you care?. Interactions, 29(2), 66-69.
Dash, S., Shakyawar, S.K., Sharma, M. et al. Big data in healthcare: management, analysis and future prospects. J Big Data 6, 54 (2019). https://doi.org/10.1186/s40537-019-0217-0
Batko, K., & Ślęzak, A. (2022). The use of Big Data Analytics in healthcare. Journal of big data, 9(1), 3. https://doi.org/10.1186/s40537-021-00553-4
Cuevas-González, D., García-Vázquez, J. P., Bravo-Zanoguera, M., López-Avitia, R., Reyna, M. A., Zermeño-Campos, N. A., & González-Ramírez, M. L. (2022). ECG Standards and Formats for Interoperability between mHealth and Healthcare Information Systems: A Scoping Review. International journal of environmental research and public health, 19(19), 11941. https://doi.org/10.3390/ijerph191911941
Aldoseri A, Al-Khalifa KN, Hamouda AM. Re-Thinking Data Strategy and Integration for Artificial Intelligence: Concepts, Opportunities, and Challenges. Applied Sciences. 2023; 13(12):7082. https://doi.org/10.3390/app13127082
Sarkies, M. N., Bowles, K. A., Skinner, E. H., Mitchell, D., Haas, R., Ho, M., Salter, K., May, K., Markham, D., O'Brien, L., Plumb, S., & Haines, T. P. (2015). Data collection methods in health services research: hospital length of stay and discharge destination. Applied clinical informatics, 6(1), 96–109. https://doi.org/10.4338/ACI-2014-10-RA-0097
Rios, R., Miller, R. J. H., Manral, N., Sharir, T., Einstein, A. J., Fish, M. B., Ruddy, T. D., Kaufmann, P. A., Sinusas, A. J., Miller, E. J., Bateman, T. M., Dorbala, S., Di Carli, M., Van Kriekinge, S. D., Kavanagh, P. B., Parekh, T., Liang, J. X., Dey, D., Berman, D. S., & Slomka, P. J. (2022). Handling missing values in machine learning to predict patient-specific risk of adverse cardiac events: Insights from REFINE SPECT registry. Computers in biology and medicine, 145, 105449. https://doi.org/10.1016/j.compbiomed.2022.105449
Nazer, L. H., Zatarah, R., Waldrip, S., Ke, J. X. C., Moukheiber, M., Khanna, A. K., Hicklen, R. S., Moukheiber, L., Moukheiber, D., Ma, H., & Mathur, P. (2023). Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS digital health, 2(6), e0000278. https://doi.org/10.1371/journal.pdig.0000278
Praveena A, Bharathi B. An approach to remove duplication records in healthcare dataset based on Mimic Deep Neural Network (MDNN) and Chaotic Whale Optimization (CWO). Concurrent Engineering. 2021;29(1):58-67. doi:10.1177/1063293X21992014
Nargesian, Fatemeh & Samulowitz, Horst & Khurana, Udayan & Khalil, Elias & Turaga, Surya Deepak. (2017). Learning Feature Engineering for Classification. 2529-2535. 10.24963/ijcai.2017/352.
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., & Saeed, J. (2020). A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. Journal of Applied Science and Technology Trends, 1(1), 56-70.
Cerda, P., & Varoquaux, G. (2020). Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1164-1176.
Dong, G., & Liu, H. (Eds.). (2018). Feature engineering for machine learning and data analytics. CRC press.
Yang, A., Yang, X., Wu, W., Liu, H., & Zhuansun, Y. (2019). Research on feature extraction of tumor image based on convolutional neural network. IEEE access, 7, 24204-24213.
Reitermanova, Z. (2010, June). Data splitting. In WDS (Vol. 10, pp. 31-36). Prague: Matfyzpress.
Baglaeva, E. M., Sergeev, A. P., Shichkin, A. V., & Buevich, A. G. (2020). The effect of splitting of raw data into training and test subsets on the accuracy of predicting spatial distribution by a multilayer perceptron. Mathematical Geosciences, 52, 111-121.
Farias, F., Ludermir, T., & Bastos-Filho, C. (2020). Similarity Based Stratified Splitting: an approach to train better classifiers. arXiv preprint arXiv:2010.06099.
Vepakomma, P., Gupta, O., Swedish, T., & Raskar, R. (2018). Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564.
Emmert-Streib, F., & Dehmer, M. (2019). Evaluation of regression models: Model assessment, model selection and generalization error. Machine learning and knowledge extraction, 1(1), 521-551.
Tu, S., & Xu, L. (2012). A theoretical investigation of several model selection criteria for dimensionality reduction. Pattern Recognition Letters, 33(9), 1117-1126.
de Hond, A. A., Leeuwenberg, A. M., Hooft, L., Kant, I. M., Nijman, S. W., van Os, H. J., ... & Moons, K. G. (2022). Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ digital medicine, 5(1), 2.
Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in neural information processing systems, 32.
Smith, S. L., Kindermans, P. J., Ying, C., & Le, Q. V. (2017). Don't decay the learning rate, increase the batch size.
arXiv preprint arXiv:1711.00489.
Komatsuzaki, A. (2019). One epoch is all you need. arXiv preprint arXiv:1906.06669.
Hardt, M., Recht, B., & Singer, Y. (2016, June). Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning (pp. 1225-1234). PMLR.
Salman, S., & Liu, X. (2019). Overfitting mechanism and avoidance in deep neural networks. arXiv preprint arXiv:1901.06566.
Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593.
Pfob, A., Lu, S. C., & Sidey-Gibbons, C. (2022). Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison. BMC medical research methodology, 22(1), 282.
Kaliappan, J., Bagepalli, A. R., Almal, S., Mishra, R., Hu, Y. C., & Srinivasan, K. (2023). Impact of Cross-Validation on Machine Learning Models for Early Detection of Intrauterine Fetal Demise. Diagnostics, 13(10), 1692.
GhoshRoy, D., Alvi, P. A., & Santosh, K. C. (2022). Explainable AI to predict male fertility using extreme gradient boosting algorithm with SMOTE. Electronics, 12(1), 15.
John, M. M., Holmström Olsson, H., & Bosch, J. (2021). Architecting AI deployment: A systematic review of state-of-the-art and state-of-practice literature. In Software Business: 11th International Conference, ICSOB 2020, Karlskrona, Sweden, November 16–18, 2020, Proceedings 11 (pp. 14-29). Springer International Publishing.
Yan, F., Ruwase, O., He, Y., & Chilimbi, T. (2015, August). Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1355-1364).
Chen, W., Milosevic, Z., Rabhi, F. A., & Berry, A. (2023). Real-Time Analytics: Concepts, Architectures and ML/AI Considerations. IEEE Access.
Feng, J., Phillips, R. V., Malenica, I., Bishara, A., Hubbard, A. E., Celi, L. A., & Pirracchio, R. (2022). Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ digital medicine, 5(1), 66.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Aditya Gadiko (Author)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.