Let’s not make the same mistakes we did 10 years ago. It is possible to deploy large language models in the cloud more cost-effectively and with less risk. Credit: Thinkstock In the past two years, I’ve been involved with generative AI projects using large language models (LLMs) more than traditional systems. I’ve become nostalgic for serverless cloud computing. Their applications range from enhancing conversational AI to providing complex analytical solutions across industries and many functions beyond that. Many enterprises deploy these models on cloud platforms because there is a ready-made ecosystem of public cloud providers and it’s the path of least resistance. However, it’s not cheap. Clouds also offer other benefits such as scalability, efficiency, and advanced computational capabilities (GPUs on demand). The LLM deployment process on public cloud platforms has lesser-known secrets that can significantly impact success or failure. Perhaps because there are not many AI experts out there who can deal with LLMs, and because we have not been doing this for a long time, there are a lot of gaps in our knowledge. Let’s explore three lesser-known “tips” for deploying LLMs on clouds that perhaps even your AI engineers may not know. Considering that many of those guys and gals earn north of $300,000, maybe it’s time to quiz them on the details of doing this stuff right. I see more mistakes than ever as everyone runs to generative AI like their hair is on fire. [ Download our editors’ PDF AI as a service (AIaaS) enterprise buyer’s guide today! ] Managing cost efficiency and scalability One of the primary appeals of using cloud platforms for deploying LLMs is the ability to scale resources as needed. We don’t have to be good capacity planners because the cloud platforms have resources we can allocate with a mouse click or automatically. But wait, we’re about to make the same mistakes we made when first using cloud computing. Managing cost while scaling is a skill that many need help with to navigate effectively. Remember, cloud services often charge based on the compute resources consumed; they function as a utility. The more you process, the more you pay. Considering that GPUs will cost more (and burn more power), this is a core concern with LLMs on public cloud providers. Make sure you utilize cost management tools, both those provided by cloud platforms and those offered by solid third-party cost governance and monitoring players (finops). Examples would be implementing auto-scaling and scheduling, choosing suitable instance types, or using preemptible instances to optimize costs. Also, remember to continuously monitor the deployment to adjust resources based on usage rather than just using the forecasted load. This means avoiding overprovisioning at all costs (see what I did there?). Data privacy in multitenant environments Deploying LLMs often involves processing vast amounts of data and trained knowledge models that might contain sensitive or proprietary data. The risk in using public clouds is that you have neighbors in the form of processing instances operating on the same physical hardware. Therefore, public clouds do come with the risk that as data is stored and processed, it’s somehow accessed by another virtual machine running on the same physical hardware in the public cloud data center. Ask a public cloud provider about this, and they will run to get their updated PowerPoint presentations, which will show that this is not possible. While that is mainly true, it’s not entirely accurate. All multitenant systems come with this risk; you need to mitigate it. I’ve found that the smaller the cloud provider, such as the many that operate in just a single country, the more likely this will be an issue. This is for data storage and LLMs. The secret is to select cloud providers that comply with stringent security standards that they can prove: at-rest and in-transit encryption, identity and access management (IAM), and isolation policies. Of course, it’s a much better idea for you to implement your security strategy and security technology stack to ensure the risk is low with the multitenant use of LLMs on clouds. Handling stateful model deployment LLMs are mostly stateful, which means they maintain information from one interaction to the next. This old trick provides a new benefit: the ability to enhance efficiency in continuous learning scenarios. However, managing the statefulness of these models in cloud environments, where instances might be ephemeral or stateless by design, is tricky. Orchestration tools such as Kubernetes that support stateful deployments are helpful. They can leverage persistent storage options for the LLMs and be configured to maintain and operate their state across sessions. You’ll need this to support the LLM’s continuity and performance. With the explosion of generative AI, deploying LLMs on cloud platforms is a foregone conclusion. For most enterprises, it’s just too convenient not to use the cloud. My fear with this next mad rush is that we’ll miss things that are easy to address and we’ll make huge, costly mistakes that, at the end of the day, were mostly avoidable. Related content analysis Strategies to navigate the pitfalls of cloud costs Cloud providers waste a lot of their customers’ cloud dollars, but enterprises can take action. By David Linthicum Nov 15, 2024 6 mins Cloud Architecture Cloud Management Cloud Computing analysis Understanding Hyperlight, Microsoft’s minimal VM manager Microsoft is making its Rust-based, functions-focused VM tool available on Azure at last, ready to help event-driven applications at scale. By Simon Bisson Nov 14, 2024 8 mins Microsoft Azure Rust Serverless Computing how-to Docker tutorial: Get started with Docker volumes Learn the ins, outs, and limits of Docker's native technology for integrating containers with local file systems. By Serdar Yegulalp Nov 13, 2024 8 mins Devops Cloud Computing Software Development news Red Hat OpenShift AI unveils model registry, data drift detection Cloud-based AI and machine learning platform also adds support for Nvidia NIM, AMD GPUs, the vLLM runtime for KServe, KServe Modelcars, and LoRA fine-tuning. By Paul Krill Nov 12, 2024 3 mins Generative AI PaaS Artificial Intelligence Resources Videos