GENERATIVE AI AND LLM OPTIMIZING TECHNIQUES FOR DEVELOPING COST EFFECTIVE ENTERPRISE APPLICATIONS
Keywords:
Generative AI, LLM, Enterprise Applications, AI, Hosting LLMs, LLM In Kubernetes, Quantization, LLM Optimization, Pruning, Llama, Cost OptimizationAbstract
Generative AI usage has increased exponentially since start of the year and has created tremendous opportunities from startups to large enterprises. As more and more LLMs are released for research and commercial use, it becomes complex for enterprises to adopt the LLMs either using a managed service offering or even hosting it in-house as the cost is extremely high. This paper will focus on helping companies to optimize LLM, provide examples of use cases and solutions on fine tuning, cost optimizations, hosting LLM models internally in Kubernetes to solve data privacy, security and governance risks.
References
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen, A Survey of Large Language Modelshttps://arxiv.org/abs/2303.18223
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314, https://github.com/artidoro/qlora
Suresh Bhojwani, Supercharging Language Models: Strategies for Optimizing LLM and GPT, https://medium.com/@sureshbhojwani001/supercharging-language-models-strategies-for-optimizing-llm-and-gpt-f5cd59e706ca
Xinyin Ma, Gongfan Fang, Xinchao Wang, LLM-Pruner: On the Structural Pruning of Large Language Models, https://arxiv.org/abs/2305.11627
Montana Low, Smaller is Better: Q8-Chat LLM is an Efficient Generative AI Experience on Intel® Xeon® Processors, https://www.intel.com/content/www/us/en/developer/articles/case-study/q8-chat-efficient-generative-ai-experience-xeon.html
Announcing GPTQ & GGML Quantized LLM support for Huggingface Transformers, https://postgresml.org/blog/announcing-gptq-and-ggml-quantized-llm-support-for-huggingface-transformers
Ben Dickson, The complete guide to LLM fine-tuning, https://bdtechtalks.com/2023/07/10/llm-fine-tuning/
Zain Hasan, Running Large Language Models Privately - privateGPT and Beyond, https://weaviate.io/blog/private-llm
Tomaz Bratanic, Knowledge Graphs & LLMs: Fine-Tuning vs. Retrieval-Augmented Generation, https://neo4j.com/developer-blog/fine-tuning-retrieval-augmented-generation/
Vantage Team, Optimizing Large Language Models for Cost Efficiency, https://www.vantage.sh/blog/optimize-large-language-model-costs
Matt Rickard, A Hacker's Guide to LLM Optimization, https://matt-rickard.com/a-hackers-guide-to-llm-optimization
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang , PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization https://arxiv.org/abs/2306.05087
Sunil Ramlochan, Master Prompt Engineering: LLM Embedding and Fine-tuning, https://www.promptengineering.org/master-prompt-engineering-llm-embedding-and-fine-tuning/
Sunyan, The Economics of Large Language Models, https://sunyan.substack.com/p/the-economics-of-large-language-models
Sunal Swarkar, Understanding LLaMA-2 Architecture & its Ginormous Impact on GenAI, https://medium.com/towards-generative-ai/understanding-llama-2-architecture-its-ginormous-impact-on-genai-e278cb81bd5c
Llama 2: Open Foundation and Fine-Tuned Chat Models https://arxiv.org/pdf/2307.09288.pdf
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang, Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers, https://arxiv.org/pdf/2210.17323.pdf
Elias Frantar, Dan Alistarh, SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot, https://arxiv.org/abs/2301.00774
Saverio Proto, Is possible to run Llama2 with 70B parameters on Azure Kubernetes Service with LangChain agents and tools ?, https://medium.com/microsoftazure/is-possible-to-run-llama2-with-70b-parameters-on-azure-kubernetes-service-with-langchain-agents-and-e6664ea52723
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Amreth Chandrasehar (Author)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.