What is Observability
Definition
Observability is the ability to understand the internal state of a system by examining its external outputs: logs, metrics, and traces. The term comes from control theory, where a system is “observable” if you can determine its internal state from its outputs.
In software engineering, observability means instrumenting your systems so that when something unexpected happens, you can investigate and understand why without deploying new code or adding new instrumentation.
Your e-commerce platform is slow. Not down, just slow. Checkout takes 8 seconds instead of the usual 2. There are no errors in the logs. No alerts have fired. CPU and memory look normal. Something is wrong, but your monitoring dashboards aren’t telling you what. You need to ask questions you didn’t anticipate when you set up the dashboards, trace the request path through a dozen services, and understand behavior you’ve never seen before.
This is the difference between monitoring and observability. Monitoring tells you when something you expected to go wrong does go wrong. Observability lets you ask arbitrary questions about your system’s behavior, including questions you didn’t know you’d need to ask.
The Three Pillars of Observability
Observability in software systems is traditionally built on three data types, often called the “three pillars.”
Metrics
Metrics are numerical measurements collected over time: CPU usage, request latency, error rates, queue depths, memory consumption. They’re aggregated and typically displayed on dashboards as time-series graphs.
Strengths: Metrics are efficient to store and query. They’re great for dashboards, alerting, and trend analysis. You can monitor thousands of metrics across hundreds of services without prohibitive storage costs.
Limitations: Metrics tell you what is happening at an aggregate level but not why. A latency spike in your 95th percentile doesn’t tell you which requests are slow, which users are affected, or what those requests have in common.
Logs
Logs are timestamped, text-based records of discrete events: “User 12345 attempted to check out at 14:32:01 and received error 500.” They capture specific events with arbitrary detail.
Strengths: Logs provide granular, event-level detail. They can include structured data (JSON fields), stack traces, request parameters, and any other context the developer chooses to include.
Limitations: Logs are expensive at scale. A busy microservices architecture can generate terabytes of logs per day. Searching through them requires centralized log aggregation (Elasticsearch, Splunk, Datadog Logs), and costs scale with volume. The signal-to-noise ratio can be poor unless logging is carefully structured.
Traces
Distributed traces follow a single request as it moves through multiple services. A trace shows the full journey: which services were called, in what order, how long each step took, and where failures or delays occurred.
Strengths: Traces are essential for understanding behavior in distributed systems. When a checkout request is slow, a trace shows whether the delay is in the API gateway, the inventory service, the payment processor, or the database.
Limitations: Tracing requires instrumentation across every service in the request path. Incomplete tracing (some services aren’t instrumented) creates gaps that can mislead investigation. Trace storage is expensive, so most organizations sample (collecting only a percentage of traces), which means you might not have a trace for the specific request you’re investigating.
Observability vs. Monitoring
Monitoring and observability are related but different.
Monitoring is checking known conditions against thresholds: is CPU above 80%? Is error rate above 1%? Is the health check returning 200? Monitoring answers predetermined questions. You set up a dashboard and an alert for things you expect might go wrong.
Observability is the ability to ask new questions about your system’s behavior without having anticipated them in advance. Why is this specific user experiencing slow responses? What changed between the healthy state at noon and the degraded state at 3 PM? Which deployments in the last 24 hours touched the code path that’s now failing?
| Monitoring | Observability | |
|---|---|---|
| Goal | Detect known failure modes | Understand any failure, including novel ones |
| Approach | Predefined thresholds and alerts | Exploratory analysis of telemetry data |
| Questions | Answers “what is broken?” | Answers “why is it broken?” |
| Data | Metrics and basic logs | Correlated metrics, logs, traces, and events |
| Scope | Individual components | End-to-end system behavior |
| Complexity fit | Simple, predictable systems | Complex, distributed, cloud-native systems |
A well-monitored system can tell you “something is wrong.” An observable system can tell you “here’s what’s wrong, here’s why, and here’s what changed.”
In practice, most teams need both. Monitoring covers the known failure modes and triggers alerts. Observability provides the investigation capability when something unexpected happens.
The Observability Stack
Modern observability typically involves several tools working together:
Data collection: Agents and SDKs that collect metrics, logs, and traces from your applications and infrastructure. OpenTelemetry has become the de facto standard for instrumentation, providing vendor-neutral collection of all three signal types.
Data storage and aggregation: Backends that store and index telemetry data. This might be a single platform (Datadog, Splunk, Dynatrace) or a combination of specialized tools (Prometheus for metrics, Elasticsearch for logs, Jaeger for traces).
Visualization: Dashboards that display metrics and system state. Grafana is the most popular open-source option. Commercial platforms like Datadog and New Relic include built-in dashboarding.
Alerting: Rules that fire notifications when conditions warrant human attention. This connects to incident management through tools like PagerDuty and Opsgenie.
Analysis and investigation: Tools for querying, correlating, and exploring telemetry data. This is where the “ask arbitrary questions” capability of observability lives.
The Cost Challenge
Observability is expensive. The more data you collect, the more you pay, and modern distributed systems generate enormous volumes of telemetry.
Pricing models for commercial observability platforms typically charge per host (for metrics), per GB ingested (for logs), and per span (for traces). As infrastructure grows, these costs compound. Organizations commonly spend $50,000 to $500,000 annually on observability tooling, with large enterprises spending well over $1 million.
This creates a difficult tradeoff. You want comprehensive observability (instrument everything, log everything, trace everything), but costs force you to make choices: sample traces at 10%, drop debug logs, reduce metric retention to 30 days. These are exactly the gaps that come back to haunt you during an incident when the data you need was the data you chose not to keep.
Observability Tools at a Glance
The observability market is crowded, and most tools fit into one of five buckets. The table below groups the common ones by the role they play in a stack.
| Category | Example Tools | What They Do |
|---|---|---|
| Full-stack | Datadog, New Relic, Dynatrace, Splunk Observability Cloud, Honeycomb | Commercial SaaS platforms that unify metrics, logs, traces, APM, and dashboards in a single product. |
| Open-source | Prometheus, Grafana, Jaeger, Elastic/OpenSearch, SigNoz | Self-hostable tools, usually assembled into a custom stack. Strong on flexibility, heavy on operational overhead. |
| Cloud-native | AWS CloudWatch, Google Cloud Operations, Azure Monitor | Provider-native telemetry, tightly integrated with the cloud’s services, IAM, and billing. |
| Standards | OpenTelemetry, OpenMetrics, W3C Trace Context | Vendor-neutral specifications for instrumentation and telemetry formats. Not products, but the plumbing that lets other tools interoperate. |
| Investigation | NeuBird AI | AI-native agents that reason over live telemetry across tools, shifting consumption from dashboards to answers. |
Most mature teams run a mix: a full-stack or open-source platform for collection and storage, cloud-native tools for provider-specific signals, OpenTelemetry as the instrumentation layer, and an investigation layer on top to turn telemetry into action.
Beyond the Three Pillars: Where Observability is Headed
The three-pillar model has served the industry well, but it has a fundamental limitation: it produces data for humans to interpret. Dashboards, log search interfaces, and trace viewers all assume a human operator who knows what to look for, which tools to check, and how to connect signals across data types.
As systems grow more complex, this human-dependent model is straining. A single incident might require correlating data from metrics, logs, traces, deployment history, configuration changes, and code diffs across 20+ services. No human can hold all of that in their head simultaneously.
The emerging shift is from observability (producing data for human interpretation) to actionability (producing answers and actions). “Dashboards only provide observability. Only the next generation of AI-native tools will provide true actionability.”
Context engineering represents one approach to this shift. Rather than collecting all data into centralized storage and hoping humans can find the right signals, context engineering dynamically assembles the relevant data for a specific question at query time. An AI agent investigating a latency issue doesn’t need every metric from every service. It needs the metrics, logs, and traces relevant to the affected request path, correlated with recent changes.
This doesn’t make observability irrelevant. The underlying data (metrics, logs, traces) is still essential. But the consumption model is changing from “human looks at dashboards” to “AI reasons over data and presents findings.”
The Rise of Real-Time, Context-Driven AI Agents
The limitations of the three-pillar model aren’t solved by collecting more data. They’re solved by consuming data differently. Real-time, context-driven AI agents, like NeuBird AI, represent the most concrete realization of this shift.
Where traditional observability platforms ingest telemetry into centralized storage and ask humans to query it, a context-driven agent queries the underlying systems at investigation time. When an incident starts, the agent pulls the specific metrics, logs, traces, deployment events, and configuration changes relevant to that incident, directly from the tools where they already live.
The benefits of this approach compound across cost, speed, and coverage:
- Fresh data, not stale indexes. Queries hit live production systems, so the agent sees the deployment that rolled out five minutes ago and the config change that landed this morning. Pre-ingested stores can’t do this without expensive, always-on ingest pipelines.
- Lower observability spend. Because context is assembled at query time, you don’t pay to store every log line, span, and metric in a vendor’s platform “just in case.” Data stays in its source system and is pulled only when it matters.
- Cross-tool correlation by default. Metrics in Prometheus, logs in Elasticsearch, traces in Datadog, deployments in GitHub, alerts in PagerDuty: the agent reasons across all of them in a single investigation. No human can hold that many tabs open.
- Truly novel questions, answered in seconds. Observability’s defining promise is that you can ask questions you didn’t anticipate. An AI agent with access to every system operationalizes that promise in a way dashboards never could.
- An evidence chain, not just an answer. Good context-driven agents surface the specific log lines, trace spans, and metric changes behind a diagnosis, so engineers can verify the reasoning before approving an action.
This approach doesn’t replace metrics, logs, and traces. It replaces the assumption that humans are the right consumers of that data. The three pillars remain essential. What changes is who reads them.
Key Takeaways
- Observability is the ability to understand a system’s internal state from its external outputs (metrics, logs, traces), enabling you to investigate unexpected behavior without deploying new code.
- It differs from monitoring: monitoring checks known conditions against thresholds, while observability lets you ask new questions you didn’t anticipate.
- The three pillars (metrics, logs, traces) each have strengths and limitations. Most teams need all three, and OpenTelemetry is standardizing collection.
- Cost is a major challenge. Observability pricing scales with data volume, forcing tradeoffs between coverage and budget.
- The industry is evolving from human-interpreted dashboards toward AI-driven actionability, where the consumption model shifts from “look at data” to “receive answers.”
Related Reading
What is Alert Fatigue? – The downstream effect when observability alerting generates too much noise.
- What is Incident Management? – The process that observability data feeds into.
- Telemetry Dashboards are Obsolete – Why the dashboard-centric model is reaching its limits.
- Tackling Observability Scale with Context Engineering – An alternative to the “store everything” approach to observability.
Frequently Asked Questions
What is observability? +
Observability is the ability to understand a system’s internal state by examining its external outputs (logs, metrics, and traces). The term comes from control theory and means you can investigate unexpected behavior without deploying new code or adding new instrumentation.
What's the difference between monitoring and observability? +
Monitoring checks known conditions against thresholds (is CPU above 80%?). Observability lets you ask new questions you didn’t anticipate (why is this specific user experiencing slow checkouts?). Monitoring is for known unknowns; observability is for unknown unknowns.
What are the three pillars of observability? +
The three pillars are metrics (numerical measurements over time), logs (timestamped event records), and traces (end-to-end request paths through distributed systems). Most modern observability practices use all three together.
What is OpenTelemetry? +
OpenTelemetry (OTel) is an open-source standard for collecting metrics, logs, and traces from applications and infrastructure. It’s vendor-neutral, meaning you can collect data once and send it to different observability backends without re-instrumenting your code.
Why is observability so expensive? +
Observability platforms typically charge per host (for metrics), per GB ingested (for logs), and per span (for traces). As infrastructure grows, these costs compound. Mid-size companies commonly spend $50K-$500K annually; large enterprises often exceed $1M.
How do I reduce observability costs? +
Common strategies include log filtering and sampling, reducing metric cardinality, shorter retention windows for non-critical data, tiered storage (hot/cold), and being selective about which services get full APM coverage. Each tradeoff creates potential blind spots, so balance carefully. A fundamentally different approach is context engineering, used by platforms like NeuBird AI, which queries live system state at investigation time instead of pre-ingesting everything into expensive storage. This avoids the cost-complexity spiral that comes with per-GB observability pricing.
What's the future of observability? +
The industry is shifting from “collect data for human interpretation” toward “AI-driven actionability.” As systems grow too complex for humans to interpret manually, AI agents that reason over telemetry data are becoming more important than dashboards designed for human consumption. Platforms like NeuBird AI use context engineering to query the right data at investigation time, effectively replacing the dashboard-centric model with autonomous reasoning over live production state.
What are the four golden signals of monitoring? +
The four golden signals, defined in the Google SRE Book, are: latency (how long requests take), traffic (request volume), errors (rate of failed requests), and saturation (how full your service is). These four metrics give you a strong baseline for understanding service health, especially for user-facing services.
Is OpenTelemetry free? +
Yes. OpenTelemetry is a free, open-source project under the Cloud Native Computing Foundation (CNCF). The instrumentation libraries, collectors, and specifications are all open source. You may pay for backend storage and analytics platforms that consume OpenTelemetry data (Datadog, Honeycomb, Grafana Cloud, etc.), but the OpenTelemetry framework itself is free.
Who coined the term "observability"? +
The term comes from control theory, where it was introduced by engineer Rudolf E. Kalman around 1960. In its software engineering context, the term was borrowed by Honeycomb in 2016 and gained widespread adoption over the following years. Honeycomb CTO Charity Majors is widely credited with bringing the concept to modern software practice.
Is observability just monitoring with extra steps? +
No. Monitoring checks predefined conditions and fires alerts when thresholds are crossed. Observability lets you ask new questions about system behavior that you didn’t anticipate when setting up your monitoring. Monitoring tells you what you expected to break is broken; observability lets you investigate things you didn’t expect.