ENHANCING MACHINE LEARNING MODEL PERFORMANCE WITH A/B TESTING TECHNIQUES
Keywords:
Machine Learning Optimization, A/B Testing Methodology, User Segmentation, Experimentation Pitfalls, Evaluation MetricsAbstract
A/B testing has emerged as a powerful tool for optimizing and fine-tuning machine learning (ML) models, enabling researchers and practitioners to make data-driven decisions based on empirical evidence. This article provides a comprehensive overview of the application of A/B testing in the context of ML model optimization, covering key concepts, methodologies, and best practices. We explore the fundamentals of ML models, the A/B testing process, and the formulation of effective testing hypotheses. The article also discusses common pitfalls and challenges associated with A/B testing experiments, emphasizing the importance of well-constructed experiments, appropriate sample sizes, and the selection of suitable evaluation metrics. Furthermore, we delve into the role of user segmentation in A/B testing, highlighting its significance in understanding the impact of proposed changes on different user groups. The article concludes by emphasizing the need for a rigorous and systematic approach to A/B testing in ML model optimization, underlining the potential of this technique to advance the field of AI and deliver more accurate, reliable, and impactful ML solutions. Through a comprehensive review of the literature and the inclusion of relevant examples, this article serves as a valuable resource for researchers and practitioners seeking to harness the power of A/B testing in their ML projects.
References
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, May 2015, doi: 10.1038/nature14539.
J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox, "Hyperopt: a Python library for model selection and hyperparameter optimization," Comput. Sci. Discov., vol. 8, no. 1, p. 014008, Jul. 2015, doi: 10.1088/1749-4699/8/1/014008.
R. Kohavi and R. Longbotham, "Online Controlled Experiments and A/B Testing," in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, 2017, pp. 922-929.
E. Dixon, E. Enos, and S. Brodmerkle, "A/B testing," in Proceedings of the 7th Workshop on Statistical Machine Translation, 2012, pp. 434-437.
D. Golovin et al., "Google Vizier: A Service for Black-Box Optimization," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1487-1495, doi: 10.1145/3097983.3098043.
Y. Bengio, A. Courville, and P. Vincent, "Representation Learning: A Review and New Perspectives," IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, Aug. 2013, doi: 10.1109/TPAMI.2013.50.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression Analysis, 5th ed. Hoboken, NJ: John Wiley & Sons, 2012.
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009.
Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, May 2015, doi: 10.1038/nature14539.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097-1105.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111-3119.
A. Graves, A. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 6645-6649, doi: 10.1109/ICASSP.2013.6638947.
T. Plappert, "Interpretable Machine Learning: A Guide for Making Black Box Models Explainable," Medium, Feb. 12, 2020. [Online]. Available: https://towardsdatascience.com/interpretable-machine-learning-1dec0f2f3e6b. [Accessed: May 19, 2023].
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann, "Online Controlled Experiments at Large Scale," in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 1168-1176, doi: 10.1145/2487575.2488217.
A. Deng, J. Lu, and S. Chen, "Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing," in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Oct. 2016, pp. 243-252, doi: 10.1109/DSAA.2016.33.
G. Burtini, J. Loeppky, and R. Lawrence, "A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit," arXiv:1510.00757 [cs, stat], Oct. 2015, Accessed: May 19, 2023. [Online]. Available: http://arxiv.org/abs/1510.00757.
J. Bergstra and Y. Bengio, "Random Search for Hyper-Parameter Optimization," J. Mach. Learn. Res., vol. 13, no. 10, pp. 281-305, 2012.
J. Snoek, H. Larochelle, and R. P. Adams, "Practical Bayesian Optimization of Machine Learning Algorithms," in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 2951-2959.
E. Bakshy, D. Eckles, and M. S. Bernstein, "Designing and Deploying Online Field Experiments," in Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 283-292, doi: 10.1145/2566486.2567967.
T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1105-1114, doi: 10.1145/1557019.1557139.
P. Dmitriev, B. Frasca, S. Gupta, R. Kohavi, and G. Vaz, "Pitfalls of Long-Term Online Controlled Experiments," in 2016 IEEE International Conference on Big Data (Big Data), Dec. 2016, pp. 1367-1376, doi: 10.1109/BigData.2016.7840743.
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu, "Seven Rules of Thumb for Web Site Experimenters," in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 1857-1866, doi: 10.1145/2623330.2623341.
E. Bakshy, D. Eckles, and M. S. Bernstein, "Designing and Deploying Online Field Experiments," in Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 283-292, doi: 10.1145/2566486.2567967.
A. Deng, J. Lu, and S. Chen, "Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing," in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Oct. 2016, pp. 243-252, doi: 10.1109/DSAA.2016.33.
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, "Controlled experiments on the web: survey and practical guide," Data Min. Knowl. Discov., vol. 18, no. 1, pp. 140-181, Feb. 2009, doi: 10.1007/s10618-008-0114-1.
T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1105-1114, doi: 10.1145/1557019.1557139.
P. Dmitriev, B. Frasca, S. Gupta, R. Kohavi, and G. Vaz, "Pitfalls of Long-Term Online Controlled Experiments," in 2016 IEEE International Conference on Big Data (Big Data), Dec. 2016, pp. 1367-1376, doi: 10.1109/BigData.2016.7840743.
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu, "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 786-794, doi: 10.1145/2339530.2339653.
T. Kluck and G. Vermeer, "When to Use Multi-Armed Bandits in Online Experiments," Medium, Mar. 01, 2019. [Online]. Available: https://booking.ai/when-to-use-multi-armed-bandits-in-online-experiments-9a2dbce69c90. [Accessed: May 19, 2023].
E. Miller, "How Not to Run an A/B Test," in Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 2195-2203.
R. Kohavi, R. M. Henne, and D. Sommerfield, "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO," in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 959-967, doi: 10.1145/1281192.1281295.
J. Hsu, "Multiple Comparisons: Theory and Methods," CRC Press, 1996.
S. Gupta, L. Ulanova, S. Bhardwaj, P. Dmitriev, P. Raff, and A. Fabijan, "The Anatomy of a Large-Scale Online Experimentation Platform," in 2018 IEEE International Conference on Software Architecture (ICSA), Apr. 2018, pp. 1-109, doi: 10.1109/ICSA.2018.00009.
R. Kohavi and S. Thomke, "The Surprising Power of Online Experiments," Harvard Business Review, vol. 95, no. 5, pp. 74-82, Sep. 2017.
P. Dmitriev and X. Wu, "Measuring Metrics," in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 429-437, doi: 10.1145/2983323.2983365.
L. Li, W. Chu, J. Langford, and R. E. Schapire, "A Contextual-Bandit Approach to Personalized News Article Recommendation," in Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 661-670, doi: 10.1145/1772690.1772758.
S. L. Scott, "A modern Bayesian look at the multi-armed bandit," Appl. Stoch. Models Bus. Ind., vol. 26, no. 6, pp. 639-658, 2010, doi: 10.1002/asmb.874.
A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Benefits of Controlled Experimentation at Scale," in 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Aug. 2017, pp. 18-26, doi: 10.1109/SEAA.2017.29.
D. Tang, A. Agarwal, D. O'Brien, and M. Meyer, "Overlapping Experiment Infrastructure: More, Better, Faster Experimentation," in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 17-26, doi: 10.1145/1835804.1835810.
A. Deng and X. Shi, "Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 77-86, doi: 10.1145/2939672.2939700.
Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, "From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 2227-2236, doi: 10.1145/2783258.2788602.
R. Kohavi and R. Longbotham, "Online Controlled Experiments and A/B Testing," in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA: Springer US, 2017, pp. 922-929.
E. Miller, S. Khandelwal, B. Mathis, K. Rabbani, and A. Sharma, "Online Controlled Experiments at Bing," in Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1193-1196, doi: 10.1145/3209978.3210199.
T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web," in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1105-1114, doi: 10.1145/1557019.1557139.
E. Bakshy, D. Eckles, and M. S. Bernstein, "Designing and Deploying Online Field Experiments," in Proceedings of the 23rd International Conference on World Wide Web, 2014, pp. 283-292, doi: 10.1145/2566486.2567967.
A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Evolution of Continuous Experimentation in Software Product Development," in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), May 2017, pp. 123-132, doi: 10.1109/ICSE-
SEIP.2017.2.
A. Deng, J. Lu, and S. Chen, "Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing," in 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Oct. 2016, pp. 243-252, doi: 10.1109/DSAA.2016.33.
S. Gupta, L. Ulanova, S. Bhardwaj, P. Dmitriev, P. Raff, and A. Fabijan, "The Anatomy of a Large-Scale Online Experimentation Platform," in 2018 IEEE International Conference on Software Architecture (ICSA), Apr. 2018, pp. 1-109, doi: 10.1109/ICSA.2018.00009.
A. Deng, Y. Xu, R. Kohavi, and T. Walker, "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data," in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 2013, pp. 123-132, doi: 10.1145/2433396.2433413.
T. Fawcett, "An introduction to ROC analysis," Pattern Recognit. Lett., vol. 27, no. 8, pp. 861-874, Jun. 2006, doi: 10.1016/j.patrec.2005.10.010.
J. Davis and M. Goadrich, "The Relationship Between Precision-Recall and ROC Curves," in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 233-240, doi: 10.1145/1143844.1143874.
M. Jeni, B. Schölkopf, and A. Fernández, "Online Controlled Experiments at Scale: Lessons and Extensions," in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3517-3517, doi: 10.1145/3394486.3406462.
C. J. C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview," Microsoft Research Technical Report MSR-TR-2010-82, Jun. 2010.
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu, "Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained," in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 786-794, doi: 10.1145/2339530.2339653.
P. Dmitriev and X. Wu, "Measuring Metrics," in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 429-437, doi: 10.1145/2983323.2983365.
S. Gupta, P. Dmitriev, L. Ulanova, S. Bhardwaj, P. Raff, and A. Fabijan, "Practical Online Controlled Experiments at Scale," in Companion Proceedings of the The Web Conference 2018, 2018, pp. 845-846, doi: 10.1145/3184558.3186212.
Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, "From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 2227-2236, doi: 10.1145/2783258.2788602.
A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, "The Evolution of Continuous Experimentation in Software Product Development," in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), May 2017, pp. 123-132, doi: 10.1109/ICSE-SEIP.2017.2.