Disclaimer: The ideas reflected in this post do not necessarily reflect the opinions, attitudes, and statements of my employer or anyone associated with me.
Observability (O11y) is never perceived as an overloaded term. Everyone knows what O11y means. The most often definitions describe it as “An ability to infer internal state of the application based on externally available data“. Sources such as Wikipedia and products like Dynatrace, Splunk. Other definitions are more narrow, defining it as “monitoring the system or application”, “collecting and visualizing metrics, events, logs, and traces” or “measure system current state based on … logs, metrics and traces”.
These definitions are very engineer-oriented. Like the book “Observability Engineering” states “observability has been unfortunately mischaracterized as a synonym for monitoring or system telemetry.”. This is why Software and DevOps engineers often feel challenged to discuss about increasing product observability with product management and business leadership. The background for this is the fact they mean different things. For engineers it is usually interpreted into something like increased volume of metrics, traces or logs to be generated and stored (somewhere). Then it comes time for questions like Which metrics / traces / logs are required to be added? or What data is missing and need to be captured? It is sometimes easier when the request is formulated as “to increase application monitoring or reliability”. In this post I will try to define O11y and describe how engineers can speak with product managers and business leadership in the same language. To save space in the following section I will reference to “product managers and business leadership” as business.
Observability is…
I like a different (more inclusive, hence more generic) definition of O11y that goes like this:
Observability is an ability to provide actionable insights about the product based on real-time and historical external data about the product.
As with any good (𝘀𝗮𝗿𝗰𝗮𝘀𝗺) definition it requires additional clarifications about terminology. External data is not some disconnected data that magically appears from nowhere or collected on the internet. It is a data that the application (I use “application” and “product” interchangably) itself, the environment where it deploys and any other tools and services related to it generate and store. It can be a data generated by CI/CD pipeline, by repositories that store the source code, binary packages or container images of the application, 3P software that is used (e.g. external Identity Provider), etc. It is common to call all this data as “telemetry data”. Telemetry data that the application and its runtime environment produce is usually divided into three distinctive groups (or types):
- Traces ‒ paths taken by units of work (like methods or request handlers) as they propagate through multi-service architectures as a part of the single business logic transaction.
- Logs ‒ timestamped messages emitted by application components. Unlike traces, however, they are not necessarily associated with any particular unit of work or transaction.
- Metrics ‒ measurements about application components, captured at runtime.
Sometimes you can see healthchecks (or uptime checks) defined as a separate group. It is true when the checks are implemented as endpoints accessible by monitoring software to identify whether an application component is responsive.
Now, when engineers talk to business they want to know what external data to collect or whether a specific external data is representative enough. The truth is that business does not care about external data, how it is generated, collected or stored. All that business cares about is insights.
Business insights vs. Operational insights
Almost all O11y discussions I participated in the last couple of years were about operational insights. Requirements for increased product reliability, improved troubleshooting capabilities or extended monitoring are usually characterize this type of insights. The main stakeholders for these requests are Operations and Product Management. Why? I am not going to preach here about Site Reliability Engineering. The main point is that knowing more about what was happening inside application at certain moment in time helps to build early alerting mechanism, to identify the root cause of the problem or at least the way to mitigate it and also to maintain the right parity between reliability and new releases’ cadence.
Business insights are different and usually cannot be derived from the same telemetry data that is used for operational insights. OK, “usually” is ambiguous here. But event the same telemetry data can be reused, business insights require much more sophisticated backend. ElasticSearch, BigQuery, Amazon OpenSearch and others are good examples of such backends. However, it is just tools. You will need to code the rules and queries in order to implement insights such as
- Clickstream analytics
- Productization tendencies
- Feature or UX preferences
It also can be used for SIEM (Security intelligence and event management) and Root-cause analysis but there are tools designated to this tasks.
This is where engineers can speak with business on equal terms. It is up to the business to define what insights they want and how they are defined.
Do not talk with business about what data they need for observability.
More engineering stuff
I, myself, am far away from business side. As many other engineers I think that I know how I would run the business. But the truth is that I don’t. So, I want to talk here little more about engineering side of O11y.
The OpenTelemetry project slowly becomes the default SDK for implementing tracing and metrics collection and ingestion. And de-facto an acknowledge open standard for telemetry data (logging specification is on its way). If you consider instrumenting your services with tracing or metrics you should strongly consider to adapt OTel library or, at least, the standard.
Working for Google Cloud, I learned to appreciate all these effortless a.k.a. serverless services and features that cloud provider provide. It includes a lot of built-in telemetry data that comes from the platform and let you observe your runtime environment and your application as well. For example, you can leverage load balancer metrics to capture reliability signals about your service. Or you can define uptime checks and get alerts fired by writing few lines of terraform script or CloudBuild / CloudFormation / ARM Template.
Delegate your work to cloud provider.
Save development and maintenance costs.
If there will be response to this post I will write about RUM vs. Synthetic monitoring and Instrumenting and collecting telemetry vs. Storing and using application data.