What is Observability
Observability is the ability to understand the internal state of a system by examining its external outputs: logs, metrics, and traces. It enables investigation of unexpected system behavior without deploying new code or adding new instrumentation.
The Three Pillars of Observability
Metrics are numerical measurements collected over time (CPU usage, latency, error rates). Strengths include efficient storage and querying; limitations include inability to explain aggregate patterns. Logs are timestamped, text-based records of discrete events. They provide granular detail but become expensive at scale in distributed systems. Traces follow a single request as it moves through multiple services. Essential for distributed systems but require comprehensive instrumentation.
Observability vs. Monitoring
Monitoring answers "what is broken?" Observability answers "why is it broken?" Monitoring validates known conditions; observability supports investigating unanticipated issues. The observability stack components include data collection (OpenTelemetry), storage/aggregation (Datadog, Splunk, Prometheus), visualization (Grafana), alerting (PagerDuty), and analysis/investigation tools. Organizations commonly spend $50,000 to $500,000 annually on observability tooling, with enterprises exceeding $1M.
Beyond the Three Pillars
The industry is evolving from observability (producing data for human interpretation) to actionability (producing answers and actions). Context engineering assembles relevant data dynamically rather than centralizing everything. AI agents query underlying systems at investigation time rather than pre-indexing. Benefits include fresh data access, lower costs, cross-tool correlation, and evidence chains supporting diagnoses.
What to remember
- 1Observability enables investigation of unexpected behavior without deploying new code
- 2Monitoring checks thresholds; observability enables unanticipated questions
- 3Three pillars (metrics, logs, traces) require OpenTelemetry standardization
- 4Cost scales with data volume, forcing coverage/budget tradeoffs
- 5Industry evolving toward "AI-driven actionability" replacing dashboard consumption
Frequently asked questions
What is observability?
The ability to understand a system's internal state by examining its external outputs (logs, metrics, and traces), enabling investigation without code redeployment.
What's the difference between monitoring and observability?
Monitoring validates known conditions; observability addresses unanticipated questions about system behavior.
What are the three pillars of observability?
Metrics (numerical data), logs (timestamped events), and traces (end-to-end request paths).
What is OpenTelemetry?
An open-source standard for collecting metrics, logs, and traces from applications and infrastructure, vendor-neutral by design.
Why is observability so expensive?
Charging per host, per GB ingested, or per span compounds with infrastructure growth, with typical spend $50K–$500K annually.
How do I reduce observability costs?
Use log filtering, sampling, metric cardinality reduction, shorter retention, and tiered storage. Context engineering queries live systems at investigation time.
What's the future of observability?
Shift toward "AI-driven actionability" where agents reason over telemetry rather than humans interpreting dashboards.
What are the four golden signals of monitoring?
Latency, traffic, errors, and saturation, per the Google SRE Book.
Who coined "observability"?
Originated in control theory by Rudolf E. Kalman (~1960); Charity Majors brought the concept to modern software engineering.
See it in action. No slides.
NeuBird AI compresses incident investigation from hours to minutes: autonomous root cause analysis, with zero manual triage.