GENERATIVE AI AND LLM OPTIMIZING TECHNIQUES FOR DEVELOPING COST EFFECTIVE ENTERPRISE APPLICATIONS

Amreth Chandrasehar

Authors

Amreth Chandrasehar Informatica, CA, USA. Author

Keywords:

Generative AI, LLM, Enterprise Applications, AI, Hosting LLMs, LLM In Kubernetes, Quantization, LLM Optimization, Pruning, Llama, Cost Optimization

Abstract

Generative AI usage has increased exponentially since start of the year and has created tremendous opportunities from startups to large enterprises. As more and more LLMs are released for research and commercial use, it becomes complex for enterprises to adopt the LLMs either using a managed service offering or even hosting it in-house as the cost is extremely high. This paper will focus on helping companies to optimize LLM, provide examples of use cases and solutions on fine tuning, cost optimizations, hosting LLM models internally in Kubernetes to solve data privacy, security and governance risks.

References

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, A Survey of Large Language Modelshttps://arxiv.org/abs/2303.18223

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314, https://github.com/artidoro/qlora

Suresh Bhojwani, Supercharging Language Models: Strategies for Optimizing LLM and GPT, https://medium.com/@sureshbhojwani001/supercharging-language-models-strategies-for-optimizing-llm-and-gpt-f5cd59e706ca

Xinyin Ma, Gongfan Fang, Xinchao Wang, LLM-Pruner: On the Structural Pruning of Large Language Models, https://arxiv.org/abs/2305.11627

Montana Low, Smaller is Better: Q8-Chat LLM is an Efficient Generative AI Experience on Intel® Xeon® Processors, https://www.intel.com/content/www/us/en/developer/articles/case-study/q8-chat-efficient-generative-ai-experience-xeon.html

Announcing GPTQ & GGML Quantized LLM support for Huggingface Transformers, https://postgresml.org/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers

Ben Dickson, The complete guide to LLM fine-tuning, https://bdtechtalks.com/2023/07/10/llm-fine-tuning/

Zain Hasan, Running Large Language Models Privately - privateGPT and Beyond, https://weaviate.io/blog/private-llm

Tomaz Bratanic, Knowledge Graphs & LLMs: Fine-Tuning vs. Retrieval-Augmented Generation, https://neo4j.com/developer-blog/fine-tuning-retrieval-augmented-generation/

Vantage Team, Optimizing Large Language Models for Cost Efficiency, https://www.vantage.sh/blog/optimize-large-language-model-costs

Matt Rickard, A Hacker's Guide to LLM Optimization, https://matt-rickard.com/a-hackers-guide-to-llm-optimization

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang , PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization https://arxiv.org/abs/2306.05087

Sunil Ramlochan, Master Prompt Engineering: LLM Embedding and Fine-tuning, https://www.promptengineering.org/master-prompt-engineering-llm-embedding-and-fine-tuning/

Sunyan, The Economics of Large Language Models, https://sunyan.substack.com/p/the-economics-of-large-language-models

Sunal Swarkar, Understanding LLaMA-2 Architecture & its Ginormous Impact on GenAI, https://medium.com/towards-generative-ai/understanding-llama-2-architecture-its-ginormous-impact-on-genai-e278cb81bd5c

Llama 2: Open Foundation and Fine-Tuned Chat Models https://arxiv.org/pdf/2307.09288.pdf

Yuxian Gu, Li Dong, Furu Wei, Minlie Huang, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers, https://arxiv.org/pdf/2210.17323.pdf

Elias Frantar, Dan Alistarh, SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot, https://arxiv.org/abs/2301.00774

Saverio Proto, Is possible to run Llama2 with 70B parameters on Azure Kubernetes Service with LangChain agents and tools ?, https://medium.com/microsoftazure/is-possible-to-run-llama2-with-70b-parameters-on-azure-kubernetes-service-with-langchain-agents-and-e6664ea52723

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685

https://aws.amazon.com/ec2/instance-types/inf2/