While Datadog Throws a Party, Your Production Is Still on Fire
Why the bill keeps climbing, why your team is drowning in alerts, and why the smartest engineering orgs are ripping it out and replacing it with an agent that runs production for them.
Today in New York, Datadog opens DASH 2026, its annual celebration of observability and AI, to thousands of customers at the Javits Center. It is the perfect moment to ask a blunt question: what are all of those customers actually paying for?
Every CTO who has scaled on Datadog knows the feeling. The platform works. Engineers like it. And then the renewal lands, and the number is bigger than anyone forecast, again.
This is not an accident and not a billing error. It is the predictable result of a pricing model that grows faster than the value you get from it. You do not have to keep paying it, and there is now a demonstrably better way to run production. Understanding why the bill behaves this way is the first step to walking away from it for good.
The bill nobody can forecast
Start with the numbers that make headlines. When Coinbase's spend surfaced on a 2023 Datadog earnings call, JPMorgan analysts reverse-engineered it at roughly $65 million for a single boom year. Industry chatter now puts OpenAI's annual Datadog spend north of $100 million, and Anthropic's above $40 million. Those are outliers with the deepest pockets in computing history. The pattern underneath them is what should concern every other engineering org: a bill that climbs faster than the value it returns, and that teams routinely under-budget by 30 to 40 percent.
That under-budgeting is the real story. The problem is not that Datadog is expensive. Plenty of valuable software is expensive. The problem is that the cost is structurally unpredictable, and unpredictable cost is a line item the CTO has to answer for every quarter, in front of the CFO and the board.
It was never one price
Datadog is not one product with one meter. It is more than 20 separately priced products, each metered on its own usage dimension, billed simultaneously. Infrastructure is priced per host. APM stacks on top, per host again. Custom metrics, containers, and a dozen other SKUs each carry their own rate card. The result is a bill that is genuinely hard to model in advance and easy to under-budget by a third.
Datadog's standard defense is that AWS meters dozens of services too, and customers accept that. But that comparison misses the point. Telling a customer that complexity is normal because the hyperscalers do it as well is like waiting an hour at the doctor's office and having the receptionist tell you that you would wait an hour anywhere. It may be true. It does not make the experience good, and it does not make the cost defensible.
Two of these mechanics deserve a closer look, because they are where the surprises actually live.
Host-based pricing at scale. Most teams genuinely like the simplicity of per-host pricing. The trouble is the price points and the allotments. In a containerized, auto-scaling world the per-host rate gets expensive fast, and the included container allotments (5 per host on Pro, 10 on Enterprise) are too low for real clusters, so overages stack up. Worse is the high-water mark mechanic: Datadog measures host count hourly, discards the top one percent of hours, and bills the whole month at the next-highest hour. A five-day scaling event can set your rate for all 30 days. Some teams have responded by moving to larger, more expensive instances purely to reduce host count, which is a perverse outcome for a tool meant to optimize your infrastructure.
Custom metrics and cardinality. This is the line item that quietly becomes 30 to 50 percent of the bill, usually unnoticed until the invoice arrives. A custom metric is billed per unique combination of metric name and tag values, so a single metric tagged across a few high-cardinality dimensions can generate hundreds of millions of billable time series. One documented case produced $250,000 a month from a single poorly-tagged metric. OpenTelemetry metrics, increasingly the industry standard, fall outside Datadog's integration list and get billed as custom metrics too. The official remedy splits one signal into two charges. The cardinality problem is real, it is large, and it is the kind of thing you find out about after the fact.
Then there is the contract dynamic. When new features reach general availability mid-term, they can be billed at list price unless you negotiated the discount up front, and public reviews repeatedly flag the sales pressure that comes with all of it. None of this is a footnote. These are the conversations that turn a renewal into a fight.
The cost that never shows up on the invoice
Pricing is only half the story. The other half is what the model does to your team.
A business that profits from data ingestion has every incentive to encourage more of it: more metrics, more signals, more alerts. The result is alert fatigue, and it is one of the most expensive problems in modern operations precisely because it never appears as a line item. Your best engineers spend their nights triaging pages that should never have fired, jumping between Datadog, Splunk, CloudWatch, and Slack to piece together what actually happened. Monitoring tells you something is wrong. It does not fix it. The work of correlating signals, finding root cause, and resolving the incident still lands on people, at 2am, again and again.
This is the hidden tax. You pay for the telemetry, and then you pay again in burned-out engineers, slower roadmaps, and the turnover that follows. Monitoring is not resolution. A dashboard shows you the fire. It does not put it out.
Datadog is becoming what Splunk became
There is a familiar arc here. A few years ago, Splunk was the well-liked, deeply embedded platform that quietly became too expensive and too hard to leave. Datadog is walking the same path: strong product-market fit, real engineering affection, and a cost curve and stickiness that increasingly work against the customer rather than for them.
The usual escape hatch is to build it yourself. The community rule of thumb puts the tipping point around $2 to 3 million a year, the point where you could hire four or five senior engineers and a manager to run observability in-house. Coinbase, analysts noted, could staff ten senior engineers and still spend under $5 million.
But here is the part the build-versus-buy math leaves out. OpenAI, with effectively unlimited resources, tried to build its own replacement and gave up because it was too time-consuming. If the company at the frontier of AI decided building observability in-house was not worth the distraction, that should tell every other CTO something. For the 99.9999 percent of organizations that are not OpenAI, "just build it" is not a real option. You are left choosing between a bill that keeps climbing and a multi-year engineering project you will never fully staff.
That is a false choice, and a better option already exists.
The agent-led future
The Production Ops Agent. It keeps your production running, so your engineers don't have to.
This is the shift, and it is not incremental. Not a cheaper dashboard, not another monitoring SKU, but an agent that does the operational work itself. NeuBird AI delivers this as a platform of specialized agents orchestrated as one Prod Ops Agent, and it beats the legacy model on economics and experience at the same time, decisively. This is what you move to when you are done with Datadog, not something you run beside it.
It prevents the incident. The Prod Ops Agent detects degradation 30 to 60 minutes before failure, which means 80 percent fewer P1 war rooms. For the board, that is a prevention posture rather than a recovery story. Your biggest risk was never the incident you responded to. It was the preventable one nobody caught in time, and that is the one that costs revenue and competitive position.
It resolves what does break. When something fails, the agent is already resolving it: evidence-backed root cause in about 5 minutes at 94 percent accuracy, with incidents resolved in roughly 5 minutes on average, autonomously, across the entire stack. No piecing the story together across five tools. One investigation, one answer. Reliability stops being something your team manages around and becomes something you can put in a customer contract and lead with.
It operates between incidents. The agent keeps working when nothing is on fire: cutting cost, capturing every fix your team figures out as reusable knowledge it applies automatically next time, and getting sharper on your environment the longer it runs. That recovers 200-plus engineering hours a month and cuts incident cost by 60 percent or more. In CTO terms, one agent runs production so your best people run the roadmap.
And the economics invert the model you have been fighting. The Prod Ops Agent runs at roughly 10 percent of the cost of the legacy stack, with no per-log-line or ingestion fees, because one agent replaces the stack instead of metering you for feeding it. The thing that made your old bill unpredictable, the incentive to ingest more, simply is not there.
For the security and compliance questions that always follow: it is SOC 2 Type II certified, zero storage and read-only by design, orchestrated with guardrails and a human in the loop, with a full audit trail for every action. We can't leak what we don't store. That is not a policy. It is the architecture.
Not a second tool. A replacement.
Be clear about what this is. You do not bolt the Prod Ops Agent onto Datadog and keep paying both bills. You replace the stack with it. Here is the head-to-head.
| Datadog | The Prod Ops Agent | |
|---|---|---|
| What you get | Dashboards and alerts that tell you something is wrong | Autonomous prevention, resolution, and operation that make it right |
| The 2am page | Lands on your on-call engineer | Handled before it pages anyone, 80% fewer P1 war rooms |
| Root cause | You piece it together across five tools | ~5 min RCA at 94% accuracy across the entire stack |
| Pricing model | 20+ meters, per-host high-water marks, cardinality surprises | One agent, no per-log-line or ingestion fees |
| Cost over time | Climbs with your data volume | Roughly a tenth of the cost, and predictable |
| The longer you run it | Gets more expensive | Gets smarter, capturing every fix as reusable knowledge |
Monitoring is not resolution. Datadog shows you the fire. The Prod Ops Agent puts it out, then keeps your production running so your engineers don't have to.
You don't have to choose between the bill and the build
The question is not really which dashboard you license. It is whether your engineers should spend their careers holding production up by hand, or building the products that move your business.
You would not build your own Datadog. There is no reason to build your own AI SRE either, and even less reason to keep feeding a model that profits from your growing data volume while your team burns out on the alerts it generates. The Prod Ops Agent gives you prevention, resolution, and operation as outcomes, at roughly a tenth of the cost, with the predictability your finance team has been asking for. It is, plainly, the better way to run production.
While DASH spends this week celebrating the dashboard, a quieter and more consequential shift is already underway in the orgs that decided they would rather not staff a war room at all. The legacy era of observability was about showing you the fire and metering you for the privilege. The next one is about putting it out before you smell smoke. That era is here, it wins on every dimension that matters, and the only real question left is how long you keep paying for the old one.
Sources
JPMorgan / The New Stack and The Pragmatic Engineer on the Coinbase figure; CostBench on median contract data; Datadog's own documentation on custom-metrics billing; Better Stack and Sysdig on cardinality and container overage; BigGo and community analysis on the build-vs-buy tipping point. NeuBird AI performance figures reflect the current product benchmarks.



