David Linthicum
Contributor

Most cloud failures have nothing to do with cloud

analysis
Jan 23, 20244 mins
Cloud ArchitectureCloud ComputingCloud Management

The verdict is in. Most cloud computing failures can be traced back to very human mistakes. What (expensive) lessons have we learned?

Man hiding under laptop in frustration because of mistakes, failure
Credit: Shutterstock

I’m often taken aback by how the press frames cloud computing failures. For instance, headlines like “The Cloud Fails to Deliver.” Those might get clicks, but they are misleading. Cloud technology has always delivered on what was promised. The issue is that human error is the core cause of cloud failures, which has not changed across generations of this technology.

As I’ve often written about here, most technology failures have a single typical pattern: misunderstandings, lack of leadership, and, in many instances, lack of knowledge and experience. As we set out to drive substantial generative AI projects in the cloud, it’s time to reflect and see how we can do better.

Top reasons for failure

The reasons that the failures occur vary a great deal. The top four that I see include:

Inadequate architecture. Too often, businesses migrate to the cloud without adequate planning or understanding of cloud computing. Significant performance or reliability issues can arise from this. Or more likely, grossly underoptimized systems in the cloud that eat 5 to 10 times more money than they should. We’ve beaten those issues to death here, and I won’t dwell on it.

Poorly defined service-level agreements (SLAs). Why do expected performance standards go unmet? It’s mainly due to ill-defined SLAs between the organization and the cloud service provider. I’ve seen this kill projects where some math could have saved everyone much pain after deployment. Although SLAs can be confusing, I’ve never seen an instance where a cloud provider did not live up to their end. Instead, the agreements lacked alignment with what the cloud users expected and what was delivered, mainly because people didn’t pay attention to the agreement before executing it.

Mismanagement of cloud resources and cost overruns. Mismanaged resources can lead to budget overruns or performance bottlenecks, often mistaken for cloud shortcomings. This is why finops exists now. Here again, when tracing these costs back to the actual cause of the problem, it’s often misalignment between what cloud users thought was being delivered for a specific price and what was actually delivered when the resources were not managed correctly.

Inadequate security and compliance processes and supporting technology. The uninformed assume the cloud provider must handle all security needs. That’s never the case, given the shared responsibility model. Cloud customers are responsible for securing their applications and data within the cloud. This involves deeply understanding complex identity and access management (IAM), encryption, and monitoring strategies. In many instances, companies don’t have the talent to handle these issues and hope for the best. This leads to breaches that make the 24-hour news cycle.

How to do better

I’m not for putting cloud computing technology on some pedestal where it can do no wrong. However, if you look at the patterns of failures, humans are the weak link much of the time. Bad decisions are traceable to misunderstanding, lack of experience, and the biggest problem, lack of skilled staff.

I suspect that the lack of talent is a result of the cloud computing market heading in two directions now. First, the technology is becoming far more complex; solutions are highly heterogeneous and have many moving parts. Second, the number of qualified cloud computing architects, security engineers, database engineers, etc., is growing below the pace of demand.

When businesses hire less-than-qualified candidates who make bonehead mistakes, the problems are discovered after months, sometimes years. Most things work well enough during deployment, but the weaknesses are uncovered later. This is when you get a vast cloud computing bill or your data is breached.

So, given that this is indeed a people issue and not a technology issue, the focus needs to be on people, which is what most of you did not want to hear. It’s time for strategic training and hiring and being very picky about who you trust to make major calls on how technology should be leveraged, including cloud technology.

It can be done, but you need to be proactive and willing to spend some money. This is where most businesses fall short, especially the ones that consider IT to be just an expense. Their attempts to save money end up costing 10,000 times any money saved. Add up the true cost of the mistakes as well as the accumulation of technical debt.

The larger issue is understanding the importance of all this. Much of what I’m listing here happens when the business does not make IT leadership a priority. You can complain about the tactical mistakes, such as not allocating enough money to hire and maintain talent. However, that comes from the top—as do most of the problems and solutions. We need to do better.

David Linthicum
Contributor

David S. Linthicum is an internationally recognized industry expert and thought leader. Dave has authored 13 books on computing, the latest of which is An Insider’s Guide to Cloud Computing. Dave’s industry experience includes tenures as CTO and CEO of several successful software companies, and upper-level management positions in Fortune 100 companies. He keynotes leading technology conferences on cloud computing, SOA, enterprise application integration, and enterprise architecture. Dave writes the Cloud Computing blog for InfoWorld. His views are his own.

More from this author