The verdict is in. Most cloud computing failures can be traced back to very human mistakes. What (expensive) lessons have we learned? Credit: Shutterstock I’m often taken aback by how the press frames cloud computing failures. For instance, headlines like “The Cloud Fails to Deliver.” Those might get clicks, but they are misleading. Cloud technology has always delivered on what was promised. The issue is that human error is the core cause of cloud failures, which has not changed across generations of this technology. As I’ve often written about here, most technology failures have a single typical pattern: misunderstandings, lack of leadership, and, in many instances, lack of knowledge and experience. As we set out to drive substantial generative AI projects in the cloud, it’s time to reflect and see how we can do better. Top reasons for failure The reasons that the failures occur vary a great deal. The top four that I see include: Inadequate architecture. Too often, businesses migrate to the cloud without adequate planning or understanding of cloud computing. Significant performance or reliability issues can arise from this. Or more likely, grossly underoptimized systems in the cloud that eat 5 to 10 times more money than they should. We’ve beaten those issues to death here, and I won’t dwell on it. Poorly defined service-level agreements (SLAs). Why do expected performance standards go unmet? It’s mainly due to ill-defined SLAs between the organization and the cloud service provider. I’ve seen this kill projects where some math could have saved everyone much pain after deployment. Although SLAs can be confusing, I’ve never seen an instance where a cloud provider did not live up to their end. Instead, the agreements lacked alignment with what the cloud users expected and what was delivered, mainly because people didn’t pay attention to the agreement before executing it. Mismanagement of cloud resources and cost overruns. Mismanaged resources can lead to budget overruns or performance bottlenecks, often mistaken for cloud shortcomings. This is why finops exists now. Here again, when tracing these costs back to the actual cause of the problem, it’s often misalignment between what cloud users thought was being delivered for a specific price and what was actually delivered when the resources were not managed correctly. Inadequate security and compliance processes and supporting technology. The uninformed assume the cloud provider must handle all security needs. That’s never the case, given the shared responsibility model. Cloud customers are responsible for securing their applications and data within the cloud. This involves deeply understanding complex identity and access management (IAM), encryption, and monitoring strategies. In many instances, companies don’t have the talent to handle these issues and hope for the best. This leads to breaches that make the 24-hour news cycle. How to do better I’m not for putting cloud computing technology on some pedestal where it can do no wrong. However, if you look at the patterns of failures, humans are the weak link much of the time. Bad decisions are traceable to misunderstanding, lack of experience, and the biggest problem, lack of skilled staff. I suspect that the lack of talent is a result of the cloud computing market heading in two directions now. First, the technology is becoming far more complex; solutions are highly heterogeneous and have many moving parts. Second, the number of qualified cloud computing architects, security engineers, database engineers, etc., is growing below the pace of demand. When businesses hire less-than-qualified candidates who make bonehead mistakes, the problems are discovered after months, sometimes years. Most things work well enough during deployment, but the weaknesses are uncovered later. This is when you get a vast cloud computing bill or your data is breached. So, given that this is indeed a people issue and not a technology issue, the focus needs to be on people, which is what most of you did not want to hear. It’s time for strategic training and hiring and being very picky about who you trust to make major calls on how technology should be leveraged, including cloud technology. It can be done, but you need to be proactive and willing to spend some money. This is where most businesses fall short, especially the ones that consider IT to be just an expense. Their attempts to save money end up costing 10,000 times any money saved. Add up the true cost of the mistakes as well as the accumulation of technical debt. The larger issue is understanding the importance of all this. Much of what I’m listing here happens when the business does not make IT leadership a priority. You can complain about the tactical mistakes, such as not allocating enough money to hire and maintain talent. However, that comes from the top—as do most of the problems and solutions. We need to do better. Related content analysis Strategies to navigate the pitfalls of cloud costs Cloud providers waste a lot of their customers’ cloud dollars, but enterprises can take action. By David Linthicum Nov 15, 2024 6 mins Cloud Architecture Cloud Management Cloud Computing analysis Understanding Hyperlight, Microsoft’s minimal VM manager Microsoft is making its Rust-based, functions-focused VM tool available on Azure at last, ready to help event-driven applications at scale. By Simon Bisson Nov 14, 2024 8 mins Microsoft Azure Rust Serverless Computing how-to Docker tutorial: Get started with Docker volumes Learn the ins, outs, and limits of Docker's native technology for integrating containers with local file systems. By Serdar Yegulalp Nov 13, 2024 8 mins Devops Cloud Computing Software Development news Red Hat OpenShift AI unveils model registry, data drift detection Cloud-based AI and machine learning platform also adds support for Nvidia NIM, AMD GPUs, the vLLM runtime for KServe, KServe Modelcars, and LoRA fine-tuning. By Paul Krill Nov 12, 2024 3 mins Generative AI PaaS Artificial Intelligence Resources Videos