Shipping software has always been about balancing speed and quality control. Many great technology companies built their empires by mastering this skill. Credit: Thinkstock Shipping software has always been about balancing speed and quality control. In fact, many great technology companies built their empires by mastering this skill. Their ideas and practices around striking that balance created whole new classes of tooling that are now mainstream in the software world—application containerization, CI/CD pipelines, and cloud computing are being rapidly adopted by forward-thinking software organizations worldwide. The modern software ecosystem is slowly evolving into a set of intertwined, well-oiled, almost fully-automated systems that produce the final piece of software delivered to our customers. This “renaissance of infrastructure,” however, is not without its faults. In the middle of all this innovation one age-old problem is actually getting much harder for developers: figuring out what exactly is going on inside these shiny artifacts our users love so much. More importantly, it’s also getting harder to understand what the developers who build these pieces of software can do to fix them when they break in production. A vast, unknowable world of bugs Many classes of bugs used to be easier to predict and identify back when applications were mostly designed to run on a single machine. After all, development and test environments could closely resemble the target production environment, there were fewer potential third party services (hence fewer potential sets of third-party configurations), and the number of dependencies was significantly lower. Today’s applications are, by and large, built to run on distributed systems. More specifically, many organizations are adopting the practice of building cloud-native applications—applications that are specifically designed to run on modern, scalable cloud infrastructure. Think Kubernetes, service meshes, and microservices instead of bare-metal, single-tenant monoliths. Cloud-native technologies make production environments exponentially more complex. One might say that the problem is not with the technologies themselves, but that the increase in “moving parts” indirectly leads to developers having a harder time grasping all of the potential flaws in the system. This complexity also gives developers more opportunities and edge cases to make logic mistakes, opens the gates for more configuration mistakes, and reduces the ability to discuss a complete system in any one conversation (without blocking a full day to talk about all the components). We need better quality controls in order to fight this battle head-on. In practice, that means we need some sort of mechanism to alert us when our software is not living up to the standards we expect it to. Debugging in production is a quality control Quality controls like code reviews and proper testing are not going to disappear. They have their rightful place in the world. However, there are too many subtleties in production for us to be able to anticipate every problem in advance—i.e., during development or while writing our tests. In fact, development and testing are the wrong places to apply those mind cycles. The most practical (and fastest) way to get to the root cause of a production problem is debugging in production. Where else can you see real users, real requests, real data, real infrastructure, real… everything? Not every bug reproduces locally. Not every application can be easily spun up in an environment that replicates the production system state at the relevant point in time. Not all data is easily extractable and available for the debugging developer’s consumption. It’s easy in theory, but in practice there’s always one more thing we forgot to mimic along the way—a set of configurations, a certain load on the database, a specific outage in a third-party service. It all comes back to balancing the speed of shipping software and quality control. Every company wants to deliver new features fast, iterate, and repeat. But in order to make “moving fast” work, teams need to push code to production as well as see how that code behaves with real usage. Developers need a safe and easy way to get a grip on their production services—from within the tools they already use. Troubleshooting the pitch-black box When an IT operator or a devops engineer gets that dreaded page that a service is down, they have a full suite of health metrics and a variety of buttons to press and knobs to turn in order to restore service health. They have mechanisms that allow them to get a 360-degree view of the situation. Developers… not so much. We’re doomed to scroll through endless logs and make do with dashboards of the infrastructure metrics and whatever information we remembered to instrument. When a developer gets that dreaded notification that there is a bug, it’s like the early stages of a crime scene investigation that has no suspects and only traces of clues. Usually the first steps are asking questions like, “Did this happen for just this user or for all users?” Or, “Was it related to a recent feature rollout or a configuration change?” Thus begins a long process of elimination, with the developer sifting through application logs to try and draw correlations between their working theories and what actually happened. More often than we’d like to admit, the problem lurks inside a black box—an object or path we simply did not log for from the beginning. This is often followed by a series of hotfixes that are applied during the investigation, resulting in many precious incident minutes spent waiting for the CI/CD pipeline to finish and for releases to roll—all to get a better peek at what’s happening inside the running production application. This is a time-consuming, expensive, and decidedly non-agile process that also uproots developers from the tooling they know and love and drops them into the world of IT monitoring, dashboards, and GUIs. Not good. Developers deserve developer-native observability Thinking back for a moment on the dramatic evolution currently underway in the software world, it’s amazing to see how many state-of-the-art systems were built solely for the purpose of allowing developers to get from code to a running application in production faster and with less manual work. Troubleshooting production code should be just as fast and easy, with the same tooling that allows for an agile, real-time, and developer-native process. Developers—not operators—should own reliability of application logic, as well as the discovery and remediation of bugs in production. Developers need observability tooling that is integrated into the development workflows, into the IDEs, into the source control. Unfortunately, we’re not there yet. Our tools for breaking apart code-level issues in real time are still cumbersome. Although it’s easy to ship your code to a machine in another continent, it’s difficult to understand the structure of an object or its size in memory if you didn’t explicitly log for that from the get-go. As a longtime developer and manager who has had the privilege of leading a team of excellent developers, I can say for a fact that we deserve better tooling. Tools that work where we work. Tools that are integrated into our workflow, not borrowed from other roles in the company. We need tools that allow us to “touch” the system (without affecting its actual state) and gain a better view of how our applications are behaving when they aren’t behaving as they should. Leonid Blouvshtein is cofounder and CTO at Tel-Aviv based Lightrun. — New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com. Related content feature 14 great preprocessors for developers who love to code Sometimes it seems like the rules of programming are designed to make coding a chore. Here are 14 ways preprocessors can help make software development fun again. By Peter Wayner Nov 18, 2024 10 mins Development Tools Software Development feature Designing the APIs that accidentally power businesses Well-designed APIs, even those often-neglected internal APIs, make developers more productive and businesses more agile. By Jean Yang Nov 18, 2024 6 mins APIs Software Development news Spin 3.0 supports polyglot development using Wasm components Fermyon’s open source framework for building server-side WebAssembly apps allows developers to compose apps from components created with different languages. By Paul Krill Nov 18, 2024 2 mins Microservices Serverless Computing Development Libraries and Frameworks news Go language evolving for future hardware, AI workloads The Go team is working to adapt Go to large multicore systems, the latest hardware instructions, and the needs of developers of large-scale AI systems. By Paul Krill Nov 15, 2024 3 mins Google Go Generative AI Programming Languages Resources Videos