Attending Red Hat Summit? Join fellow leaders for an exclusive roundtable dinner on May 12

What is Production Readiness

Definition

Production readiness is the practice of verifying that a service meets a defined set of reliability, observability, and operational standards before it’s deployed to production.

Your team has been building a new recommendation engine for three months. The code is tested, the product manager has signed off, and everyone’s ready to ship. But has anyone asked: does this service have health checks? Can we roll it back in under a minute? Is there an on-call rotation that covers it? Do the dashboards exist? Does anyone besides the original developer know how to debug it?

Production readiness is the practice of verifying that a service meets a defined set of reliability, observability, and operational standards before it’s deployed to production. It’s the gate between “this code works” and “this code is safe to run in front of real users, at scale, at 3 AM when the person who built it is on vacation.”

Google formalized this concept as the Production Readiness Review (PRR) in their SRE Book, and it has since become standard practice at companies that take operational reliability seriously.

Why Production Readiness Matters

The gap between “works in staging” and “works in production” is where most operational pain lives. A service that passes all its tests in a controlled environment can fail in production for dozens of reasons: unexpected traffic patterns, dependency failures, resource contention, configuration differences, network partitioning, and more.

Production readiness reviews catch these gaps before they become 3 AM incidents. They force teams to think about operational concerns (monitoring, alerting, rollback, capacity, documentation) that are easy to deprioritize during feature development.

Without a production readiness process, organizations tend to discover operational gaps reactively, through incidents. The on-call engineer gets paged for a new service, discovers there are no runbooks, the logs aren’t being collected, and the only person who understands the architecture is unavailable. This is expensive, stressful, and entirely preventable.

The Production Readiness Checklist

A production readiness checklist varies by organization, but most cover the same core areas. Here’s a comprehensive template:

Observability

[ ] Service emits structured logs to the centralized logging platform

[ ] Key metrics are instrumented: request rate, error rate, latency (RED metrics)

[ ] Distributed tracing is configured and spans are propagated correctly

[ ] Dashboards exist showing service health at a glance

[ ] SLOs are defined with error budgets

Alerting

[ ] Alerts exist for SLO violations (not just raw metric thresholds)

[ ] Alert fatigue risk has been assessed: alerts are actionable and not duplicative

[ ] Alerts route to the correct on-call rotation

[ ] Escalation policies are configured

Reliability

[ ] Health check endpoints are implemented and registered with the load balancer

[ ] Graceful shutdown is implemented (drain connections, complete in-flight requests)

[ ] Retry logic includes backoff and circuit breakers to prevent cascade failures

[ ] Resource limits (CPU, memory) are configured for containerized services

[ ] The service degrades gracefully under partial dependency failures

Deployment and Rollback

[ ] Deployment is automated through CI/CD

[ ] Rollback procedure is documented and tested (can be executed in under 5 minutes)

[ ] Feature flags are available for new functionality

[ ] Canary or blue-green deployment strategy is configured

[ ] Database migrations are backward-compatible (support rollback without data loss)

Capacity and Scaling

[ ] Load testing has been performed at expected peak traffic + headroom

[ ] Auto-scaling is configured with appropriate min/max bounds

[ ] Resource requests and limits reflect actual usage patterns

[ ] Dependency capacity has been validated (databases, queues, third-party APIs can handle the expected load)

Security

[ ] Authentication and authorization are implemented

[ ] Secrets are managed through a secrets manager (not environment variables or config files)

[ ] Data encryption in transit (TLS) and at rest where applicable

[ ] Dependency vulnerabilities have been scanned

Documentation and Operations

[ ] Architecture diagram exists and is current

[ ] Runbooks cover common failure modes and mitigation procedures

[ ] On-call rotation includes team members who understand the service

[ ] Dependencies are documented (upstream and downstream)

[ ] Contact information for the owning team is up to date

The Google Production Readiness Review

Google’s SRE team formalized the Production Readiness Review as a structured process where an SRE team evaluates a service’s operational fitness. The SRE Book describes it as the gate that determines whether SRE takes on operational responsibility for a service.

Key principles from Google’s approach:

It’s collaborative, not adversarial. The PRR isn’t an audit. It’s a collaboration between the development team and the SRE team to identify and close operational gaps together. The SRE team brings operational expertise; the development team brings system knowledge.

It’s a living process. Production readiness isn’t a one-time gate. Systems change, traffic patterns shift, and new dependencies get added. Regular re-reviews ensure that a service that was production-ready six months ago is still production-ready today.

It has clear criteria. The review isn’t based on gut feel. There’s a defined checklist of requirements, and each item has clear acceptance criteria. This makes the process repeatable and reduces arguments about whether a service is “ready enough.”

It’s proportional to risk. Not every service needs the same level of review. A customer-facing payment service requires more rigorous review than an internal batch processing job. Google uses a tiered approach based on the service’s criticality and blast radius.

Common Pitfalls

Treating it as a one-time gate. A service passes the PRR at launch and then drifts out of compliance over the next year as changes accumulate, team members leave, and documentation goes stale. Production readiness needs periodic re-evaluation.

Checkbox culture. When the checklist becomes a bureaucratic hurdle, teams check boxes to get through the review without actually addressing the underlying concerns. “Yes, we have a runbook” (it’s a blank template). “Yes, we have alerts” (they’re the defaults that ship with the framework).

Too late in the process. If teams only think about production readiness the week before launch, the gaps they find require either delaying the launch or cutting corners. Production readiness concerns should be addressed incrementally throughout development, not crammed into a final review.

Focusing only on the service, not its dependencies. A service might be perfectly instrumented and well-documented, but if it depends on a database with no monitoring or a third-party API with no fallback, it’s not production-ready. The review needs to consider the entire dependency chain.

How AI Tools Support Production Readiness

AI-driven operational platforms contribute to production readiness in two ways.

Gap identification. By continuously analyzing your production environment, AI agents can identify services that lack proper observability, have no runbooks, or show patterns that suggest operational risk. NeuBird AI continuously monitors telemetry patterns and can surface services where monitoring coverage is incomplete, alert rules are missing, or incident management processes have gaps.

Ongoing compliance. Production readiness isn’t just a launch gate. AI can continuously verify that services remain compliant: dashboards still work, alert rules haven’t been accidentally deleted, runbooks still reference valid endpoints, and on-call rotations still have the right people. This turns production readiness from a point-in-time review into a continuous process.

Key Takeaways

  • Production readiness is the practice of verifying a service meets reliability, observability, and operational standards before (and after) going live in production.
  • A comprehensive checklist covers observability, alerting, reliability, deployment/rollback, capacity, security, and documentation.
  • Google’s Production Readiness Review (PRR) established the industry standard: collaborative, criteria-based, proportional to risk, and ongoing rather than one-time.
  • Common pitfalls include treating it as a one-time gate, checkbox culture, starting too late, and ignoring dependency readiness.
  • AI tools can identify production readiness gaps automatically and verify ongoing compliance as systems evolve.

Related Reading

Frequently Asked Questions

What is production readiness? +

Production readiness is the practice of verifying that a service meets a defined set of reliability, observability, and operational standards before being deployed to production. It’s the gate between “this code works” and “this code is safe to run in front of real users.”

What is a Production Readiness Review (PRR)? +

A PRR is a structured process where an SRE or operations team evaluates a service’s operational fitness against a defined checklist. It has become standard practice at organizations that take operational reliability seriously.

What goes on a production readiness checklist? +

A comprehensive checklist covers observability (logs, metrics, traces), alerting (SLO-based, actionable), reliability (health checks, graceful shutdown, retry logic), deployment and rollback procedures, capacity and scaling, security, and documentation including runbooks.

Who is responsible for production readiness? +

It’s a shared responsibility between the development team (which knows the service) and the SRE or operations team (which brings operational expertise). It is a collaborative process, not an audit.

How long does a production readiness review take? +

For a moderately complex service, a thorough PRR typically takes 1-3 weeks of effort spread across the development and SRE teams. The timeline depends on how mature the service is at the start of the review and how many gaps need to be closed.

Should production readiness be a one-time gate? +

No. Systems change, dependencies shift, and team members leave. Production readiness needs periodic re-evaluation, ideally annually for critical services. Treating it as a one-time launch gate leads to drift and operational debt.

Can production readiness be automated? +

Parts of it can. Automated checks for things like health endpoints, alert rule presence, dashboard existence, and runbook references can verify ongoing compliance. AI-driven platforms like NeuBird AI can continuously monitor for production readiness gaps as systems evolve, surfacing services where monitoring coverage is incomplete, alert rules are missing, or runbooks have gone stale. This turns production readiness from a one-time launch gate into ongoing operational intelligence.

Who does production readiness reviews at Google? +

At Google, the SRE (Site Reliability Engineering) team conducts Production Readiness Reviews (PRRs) for services they take operational responsibility for. The review is a collaboration between the development team and the SRE team, not a one-sided audit. Google’s SRE Book describes the process in detail.

Is production readiness the same as go-live? +

Not quite. Go-live is the act of deploying a service to production. Production readiness is the standard the service must meet before go-live (and continuously after). You can go live without being production-ready, but you’ll likely regret it. Production readiness is the gate; go-live is the moment you pass through it.

What's the difference between dev ready and prod ready? +

Dev ready means the code works in a development environment: features are built, basic tests pass, and the service runs locally or in staging. Prod ready means the service is operationally fit for real users at scale: monitoring, alerting, scaling, security, runbooks, and rollback all work. The gap between the two is often significant.

What is in the PRR checklist? +

Common items include: SLOs defined, monitoring and alerting in place, on-call rotation established, runbooks documented, capacity planning completed, dependency analysis done, and disaster recovery tested. Most organizations adapt these principles to their own checklists.

# # # # # #
Secret Link