Observability is an increasingly vital consideration for software engineers looking to build better, more stable applications. Here is everything you need to know about observability. Credit: Gremlin / Getty Images The term “observability” started to gain serious momentum in software engineering circles around 2018, as a natural evolution of monitoring practices. By bringing together the raw outputs of metrics, events, logs, and traces, software developers could start to gain a real-time picture of how their software systems are performing and where issues might be occurring. The concept itself, however, has deep roots in the broader engineering principles of control theory, where the measure of the internal state of a system can be observed using only its external outputs. Now, with the broad shift towards distributed software systems through microservices and containers, the old adage of not being able to manage what you can’t measure has never been more relevant. Observability vs. monitoring For many people, observability will just sound like a convenient rebranding of application monitoring, and any skepticism around the latest industry buzzword is justified. However, as my colleague David Linthicum puts it, there is a basic difference: Monitoring “is something you do (a verb); observability is an attribute of a system (a noun),” he wrote. Taking things one step further, engineering manager and technical blogger Ernest Mueller wrote back in 2018 that “observability is a property of a system. You can monitor a system using various instrumentation, but if the system doesn’t externalize its state well enough that you can figure out what’s actually going on in there, then you’re stuck.” As developers have broken up their applications into smaller chunks—called microservices—hosted them in containers across distributed cloud servers, and deployed them continuously under the all-seeing eye of the devops team, the need for true observability has become increasingly critical. “As systems become more distributed, methods for building and operating them are rapidly evolving—and that makes visibility into your services and infrastructure more important than ever,” software developer Cindy Sridharan wrote in her book Distributed Systems Observability. “Observability is a superset of monitoring,” Sridharan wrote. “It provides not only high-level overviews of the system’s health but also highly granular insights into the implicit failure modes of the system. In addition, an observable system furnishes ample context about its inner workings, unlocking the ability to uncover deeper, systemic issues.” The three pillars of observability There are three commonly agreed upon pillars of observability: metrics, traces, and logs. Taken individually, these pillars represent a developer’s ability to instrument and monitor their systems. Once brought together and presented in as close to real time as possible, you can start to make those systems observable. That being said, the three pillars do not miraculously add up to observability. “It’s not about logs, metrics, or traces, but about being data-driven during debugging and using the feedback to iterate on and improve the product,” Sridharan wrote. Greg Ouillon, the CTO for Europe, the Middle East, and Africa at monitoring vendor New Relic, sees observability as a confluence of the software engineering and monitoring trends that have shaped the cloud era. “Observability addresses these challenges by rethinking monitoring and adapting to the new technology paradigm,” Ouillon said. “By providing you with a fully connected view of all software telemetry data in one place, real-time observability allows you to proactively master the performance of your digital architecture, accelerate innovation and software velocity, and reduce toil and operational costs.” Observability tools and vendor landscape The vendor landscape is fairly complex when it comes to observability, as makers of logging, monitoring, and application performance management (APM) software all stake claims to offering observability tools. “Observability a year ago was a useful term, but now is becoming a buzzword,” says Gartner analyst Josh Chessman. Take log monitoring specialists like Splunk and Sumo Logic, both of which have moved further toward end-to-end observability by developing new features and making key acquisitions to round out their platforms. Splunk’s acquisitions include cloud network performance monitoring specialist Flowmill and user and application performance monitoring specialist Plumbr in 2020. Combined with the $1 billion purchase of real-time monitoring company SignalFx in 2019, it is clear that Splunk wants to be a one-stop-shop for observability tools. Vendors like Dynatrace, Datadog, New Relic, SolarWinds, Scalyr (recently acquired by security specialist SentinelOne), and newcomer Honeycomb all also look to provide off-the-shelf instrumentation and observability as a service for engineering teams. On the open source side, Grafana Labs has built a massively popular open source monitoring and observability platform. Apache Skywalker is another open source observability tool that allows system administrators to identify issues, receive key alerts, and monitor overall system health, with or without a service mesh. The OpenTelemetry initiative is another open source project that has rapidly grown in popularity. The sandbox project—which came about as a merger between OpenCensus and OpenTracing—sits with the Cloud Native Computing Foundation (CNCF) and has gathered broad support as an emerging industry standard for observability. For developers looking to build their own observability stack from scratch, open source tools like Prometheus for metrics, Logstash for logs, and Jaegar for tracing can provide the building blocks required to get the three pillars of observability. The next phase of observability The Holy Grail for users and vendors in the observability space—whether the toolkit is proprietary, open source, or even homegrown—is to automate away the fact-finding part of the process to the point where issues are automatically spotted and can be fixed before they affect users, or, better still, where the software fixes faults before the developers are even aware of the issue on their dashboard. There is also a growing community of startups and open source projects looking at the next crop of observability challenges, such as the Signoz.io open source observability platform for Kubernetes and microservices, or Jeli, a project founded by an ex-Netflix engineer that focuses on giving developer teams the tools to map where their code is failing against the structure of their organization. Building a culture of observability It’s important to remember that the three pillars alone do not instantly combine to achieve observability; people and process must also be aligned around a set of shared goals. “The process of knowing what information to expose and how to examine the evidence (observations) at hand—to deduce likely answers behind a system’s idiosyncrasies in production—still requires a good understanding of the system and domain, as well as a good sense of intuition,” Cindy Sridharan wrote. Observability should not be the goal in and of itself, but rather viewed as a means to build and operate more reliable software for customers. “The value of the observability of a system primarily stems from the business and organizational value derived from it,” Sridharan wrote. “Being able to debug and diagnose production issues quickly not only makes for a great end-user experience, but also paves the way toward the humane and sustainable operability of a service, including the on-call experience.” Those dual incentives of better customer outcomes and a potentially easier life for software engineers should be enough to drive many organizations towards gaining better observability of their systems for years to come. Related content feature 14 great preprocessors for developers who love to code Sometimes it seems like the rules of programming are designed to make coding a chore. Here are 14 ways preprocessors can help make software development fun again. By Peter Wayner Nov 18, 2024 10 mins Development Tools Software Development feature Designing the APIs that accidentally power businesses Well-designed APIs, even those often-neglected internal APIs, make developers more productive and businesses more agile. By Jean Yang Nov 18, 2024 6 mins APIs Software Development news Spin 3.0 supports polyglot development using Wasm components Fermyon’s open source framework for building server-side WebAssembly apps allows developers to compose apps from components created with different languages. By Paul Krill Nov 18, 2024 2 mins Microservices Serverless Computing Development Libraries and Frameworks news Go language evolving for future hardware, AI workloads The Go team is working to adapt Go to large multicore systems, the latest hardware instructions, and the needs of developers of large-scale AI systems. By Paul Krill Nov 15, 2024 3 mins Google Go Generative AI Programming Languages Resources Videos