Attending Red Hat Summit? Join fellow leaders for an exclusive roundtable dinner on May 12

What is Toil in SRE?

Definition

Toil is a concept defined by Google’s SRE team in the SRE Book as manual, repetitive, automatable work that scales linearly with system size, is reactive rather than proactive, and doesn’t provide enduring value. It’s the operational busywork that keeps systems running but doesn’t make them better. Every hour spent on toil is an hour not spent on engineering work that would prevent the toil from being necessary in the future.

Every Monday morning, an engineer on your team spends 45 minutes manually restarting a batch processing job that fails over the weekend. They’ve been doing this for eight months. Everyone knows the fix: rewrite the job scheduler to handle retries automatically. But that project keeps getting deprioritized in favor of new features. So every Monday, the same engineer does the same manual work, producing no lasting value.

That’s toil.

What is Toil in SRE?

Toil is a concept defined by Google’s SRE team in the SRE Book as manual, repetitive, automatable work that scales linearly with system size, is reactive rather than proactive, and doesn’t provide enduring value. It’s the operational busywork that keeps systems running but doesn’t make them better. Every hour spent on toil is an hour not spent on engineering work that would prevent the toil from being necessary in the future.

Defining Toil: The Five Characteristics

Not all operational work is toil. The Google SRE Book identifies five characteristics that distinguish toil from valuable engineering work:

1. Manual. If a human has to execute it, it’s a toil candidate. Running a script manually counts. Writing the script that automates the task does not.

2. Repetitive. If you’re doing it for the first time, it’s not toil; it’s learning. If you’re doing it for the tenth time and it looks the same every time, it’s toil.

3. Automatable. If the task requires human judgment or creative problem-solving each time, it’s not toil even if it’s repetitive. Toil is work that could be done by a machine if someone invested the time to build the automation.

4. Tactical, not strategic. Toil is reactive. It responds to the current state of the system rather than improving the system’s fundamental design. Restarting a crashed service is toil. Redesigning the service so it doesn’t crash is engineering.

5. No enduring value. After completing the work, the system is back to its previous state, not better. Clearing disk space so a service can keep running is toil. Building automated log rotation so disk space never fills up is engineering.

Examples of Toil

  • Manually restarting services that crash periodically
  • Running deployment scripts by hand instead of through CI/CD
  • Manually scaling infrastructure in response to traffic changes
  • Rotating certificates one at a time across dozens of services
  • Copying data between environments for debugging
  • Manually triaging and routing alerts that could be auto-classified
  • Updating configuration files across multiple servers
  • Manually running data cleanup jobs on a schedule

What is NOT Toil

  • Writing postmortems after incidents (this produces lasting improvements)
  • Investigating a novel failure mode (this builds understanding)
  • Designing and building automation (this eliminates future toil)
  • On-call work that involves genuine problem-solving and judgment
  • Meetings to plan reliability improvements

Why Toil is Dangerous

Toil might seem manageable at any given moment. “It only takes 15 minutes.” But toil’s damage compounds over time in ways that aren’t immediately obvious.

Toil scales with system size. If restarting a service takes 5 minutes and you have 10 services, that’s 50 minutes of potential toil per incident. Add 20 more services and it’s 2.5 hours. Systems grow. Toil grows with them. Teams that don’t actively reduce toil eventually drown in it.

Toil crowds out engineering. There are only so many hours in a week. Every hour spent on manual operational tasks is an hour not spent on building automation, improving reliability, or developing features. Teams trapped in toil cycles fall further behind over time because they can’t invest in the improvements that would break the cycle.

Toil burns out engineers. SREs and operations engineers are typically skilled software engineers who chose operations because they find the problems interesting. Spending 80% of their time on manual, repetitive tasks that don’t challenge them intellectually leads to frustration, disengagement, and turnover.

Toil creates risk. Manual processes are error-prone. A script run by hand might work correctly 99 times and fail catastrophically on the 100th because the engineer was tired, distracted, or working from an outdated runbook. Automation executes the same way every time.

Google’s 50% Rule

The SRE Book establishes a clear guideline: SRE teams should spend no more than 50% of their time on toil and operational work. The remaining 50% should be spent on engineering work: building automation, improving systems, and reducing future toil.

If a team’s toil exceeds 50%, the SRE Book considers this a management problem that requires intervention: either hire more people, redirect engineering effort toward toil reduction, or push back on the systems generating the most toil.

This isn’t an aspirational target. It’s a hard line. Teams that consistently exceed 50% toil are in an unsustainable state. They’ll fall further behind on automation, their toil will continue growing, and their best engineers will leave.

Measuring Toil

You can’t reduce toil if you don’t measure it. Practical approaches include:

Time tracking. Have team members log how they spend their time for 2-4 weeks, categorizing activities as toil vs. engineering. This is imperfect (people estimate poorly and categories are ambiguous), but it provides a baseline.

Ticket analysis. Review your team’s operational tickets from the last quarter. How many represent repetitive, manual work? What percentage of total work hours do they consume?

On-call analysis. Review on-call incident data. How many pages triggered manual work that could have been automated? What’s the average time spent per page?

Toil per service. Track which services generate the most operational toil. This identifies where automation investment would have the highest ROI.

A healthy team’s toil breakdown might look like: 30% of time on operational work (some of which is genuinely valuable, like incident response and postmortems), 70% on engineering. A team in trouble might show: 70% operational work, 30% engineering, with the operational work dominated by repetitive manual tasks.

How to Eliminate Toil

Prioritize by frequency and impact

Not all toil is equally worth automating. Prioritize based on:

  • Frequency: How often does this task occur? Daily toil is more expensive than monthly toil.
  • Duration: How long does each occurrence take? A 5-minute daily task costs more over a quarter than a 2-hour monthly task.
  • Risk: How error-prone is the manual process? Tasks that can cause data loss or outages when done incorrectly should be automated first.
  • Scalability: Will this toil grow as the system scales? Prioritize tasks that will get worse over time.

Build runbook automation

For the most common toil-generating tasks, convert the manual runbook into automated workflows. Start with semi-automated (human triggers the script) and progress to fully automated (triggered by alerts or schedules).

Invest in self-service tooling

Much SRE toil comes from other teams requesting help: “Can you scale this up?” “Can you rotate this certificate?” “Can you give me access to these logs?” Building self-service tools that let development teams handle these tasks themselves eliminates the SRE bottleneck.

Use error budgets to justify investment

When SLO violations are driven by toil-related errors (manual process failures, delayed responses), the error budget data provides a concrete business case for automation investment.

How AI Reduces Toil

AI-driven platforms directly target the toil that’s hardest to automate with traditional scripts: the investigation and decision-making that happens during incidents.

Traditional runbook automation handles known, predictable tasks well. But a large portion of SRE toil comes from investigating alerts, triaging incidents, and performing diagnostic work that doesn’t follow a fixed script. This is where AI agents add the most value.

NeuBird AI automates the investigative toil by querying observability tools, correlating signals, and producing diagnoses without manual human effort. It also continuously identifies automation opportunities: recurring manual tasks that could be converted to automated skills, observability gaps that generate unnecessary investigation toil, and operational patterns that suggest systemic improvements.

The result is a direct reduction in the percentage of time teams spend on toil, freeing engineers to focus on the strategic, high-value work that makes systems better.

Key Takeaways

  • Toil is manual, repetitive, automatable work that scales with system size and produces no lasting value. It’s distinct from valuable operational work like incident response and system design.
  • Google SRE’s 50% rule: no more than half of an SRE’s time should go to toil. Exceeding this threshold is unsustainable and requires management intervention.
  • Toil compounds over time: it scales with system growth, crowds out engineering work, burns out engineers, and introduces risk through manual processes.
  • Measure toil through time tracking, ticket analysis, and on-call data. You can’t reduce what you don’t measure.
  • AI tools reduce the hardest-to-automate toil: investigative work, incident diagnosis, and the ongoing identification of automation opportunities.

Related Reading

Frequently Asked Questions

What is toil in SRE? +

Toil is manual, repetitive, automatable operational work that scales linearly with system size and produces no lasting value. It was defined by the Google SRE team and includes activities like manually restarting services, running deployment scripts, and rotating certificates one at a time.

What is NOT toil? +

Engineering work that produces lasting improvements is not toil. This includes writing postmortems, investigating novel failure modes, building automation, designing reliability improvements, and on-call work that requires real problem-solving and judgment.

What is Google's 50% rule for toil? +

The Google SRE Book establishes that no more than 50% of an SRE’s time should go to toil and operational work. The other 50% must be engineering work that improves systems and reduces future toil. Exceeding 50% is considered unsustainable and requires intervention.

How do I measure toil? +

Common methods include time tracking (engineers categorize their hours as toil vs. engineering), ticket analysis (review operational tickets for repetitive work), on-call data analysis, and per-service toil tracking to identify where automation would have the highest impact.

What's the difference between toil and useful operational work? +

Toil is repetitive, automatable, and produces no lasting value. Useful operational work might also be operational, but it builds knowledge, creates lasting improvements, or requires real human judgment. Restarting a crashed service is toil; investigating why it crashed is engineering.

How do I reduce toil on my team? +

Start by measuring it. Identify the highest-impact toil sources (frequency × duration × risk). Build automation for the most painful items first. Invest in self-service tooling so other teams can handle their own routine tasks. Use error budgets to justify continued investment. For the investigative toil that traditional runbook automation can’t touch (incident triage, cross-tool correlation, diagnostic work), an AI-driven platform like NeuBird AI is the most effective option.

Does AI eliminate toil? +

AI directly targets the hardest-to-automate toil: investigation and decision-making during incidents. Traditional runbook automation handles known patterns; AI agents handle the open-ended diagnostic work that traditionally required experienced engineers. Purpose-built platforms like NeuBird AI are specifically designed to eliminate investigative toil by autonomously querying observability data, correlating signals, and producing diagnoses. The result is a meaningful reduction in operational toil and engineers freed up for higher-value work.

Who coined the term "toil" in SRE? +

The specific definition of toil used in SRE was formalized by Google’s SRE team in the SRE Book, published in 2016. The book includes a dedicated chapter (“Eliminating Toil”) that defines toil’s characteristics and explains why limiting it is critical for sustainable operations. The general concept of operational toil predates this, but Google’s framing made it a discipline-defining concept.

Is on-call considered toil? +

On-call work that involves real problem-solving and judgment is not toil. On-call work that’s just running through the same scripts repeatedly because the system keeps breaking the same way is toil. The distinction is whether the work produces lasting value (engineering improvement) or just maintains the status quo.

What's an example of toil in software engineering? +

Common examples include: manually restarting services that crash repeatedly, copying data between environments for debugging, manually rotating certificates, running deployment scripts by hand, manually scaling infrastructure for predictable traffic patterns, and clearing disk space when alerts fire instead of fixing the underlying log rotation issue.

Is all manual work toil? +

No. Manual work that requires judgment, builds knowledge, or produces lasting improvements is not toil. Writing a postmortem is manual but valuable. Investigating a novel failure is manual but builds understanding. Designing a reliability improvement is manual but creates lasting value. Toil is specifically the manual work that’s repetitive, automatable, and produces nothing lasting.

# # # # # #
Secret Link