Building Guardrails Against Hallucinations in AI SRE Agents
Hallucinations are the failure mode that keeps AI developers up at night. For practitioners who worked in ML before the LLM boom, non-determinism has always been part of the job expressed as error margins, confidence intervals, or false positive rates. What has changed is scale and visibility: LLM failures are now labeled as “hallucinations” and treated as exceptional, even though they are a natural outcome of probabilistic models.
The problem becomes more severe because modern applications rarely use a single LLM call. We are increasingly building agentic systems, composed of multiple LLM invocations with chaining, iteration, tool use, and self-reflection. In such systems, even a small upstream error, such as an incorrect timestamp or a malformed identifier, can propagate through subsequent steps and result in a completely incorrect final response.
This risk is particularly high in AI SRE agents. If an LLM incorrectly identifies the service name, error code, or resource to investigate, downstream reasoning can confidently converge on the wrong diagnosis. In operational systems, this makes hallucination mitigation a correctness and reliability concern, not just a quality issue.
Why hallucinations occur is outside the scope of this article, but in short they are influenced by several factors: the underlying model, token count, token composition, instruction ambiguity, decoding parameters such as temperature, and even variability in model-serving infrastructure. The important conclusion is that hallucinations cannot be fully eliminated at the model level.
That said, while individual LLM calls are non-deterministic, the overall system does not have to be. These are still software systems with probabilistic components, and established engineering practices apply.
1. Treat LLM systems as production software
Non-deterministic components do not remove the need for rigor. Write evaluation suites, unit tests, and end-to-end tests that explicitly measure system behavior under realistic failure modes. Testing remains the primary mechanism for reducing uncertainty.
The challenge with writing tests against agentic AI powered applications usually lies in the input space. The input can vary significantly from the handful cases developers may have tested the system with. On the other hand, for ML models, tests are done in the form of large datasets and measuring the performance of the model on that curated large dataset. Though it is hard to curate a dataset that has all possible inputs to the system, it is important to start small and increase the diversity of the examples in your dataset.
2. Use structured and typed outputs
The more structured the LLM output, the easier it is to validate. Enforcing schemas, typed fields, and constrained formats significantly reduces silent failure and makes downstream checks deterministic. Many popular model providers allow api calls to be made with detailed structured output schemas.
The key properties of structured outputs are:
Explicit schema: Required fields, allowed values, and data types are defined upfront.
- Deterministic parsing: The response can be programmatically parsed without heuristics.
- Immediate validation: Invalid or incomplete outputs can be rejected before downstream use.
- Reduced hallucination surface: The model is constrained to fill known fields rather than invent narrative text.
In effect, structured outputs turn an LLM call into something closer to a typed function call than a text-generation task.
Example: Structured output in an API call
Below is an example using a schema-enforced response where the model must identify an SRE investigation target.
POST /v1/chat/completions
{
"model": "gpt-4.1",
"messages": [
{
"role": "system",
"content": "You are an AI SRE assistant. Extract investigation details from the incident."
},
{
"role": "user",
"content": "Service validation is failing with error code 503 in us-west1 since 10:42 UTC."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "incident_investigation",
"schema": {
"type": "object",
"required": ["service_name", "error_code", "region", "start_time"],
"properties": {
"service_name": {
"type": "string",
"description": "Canonical service identifier"
},
"error_code": {
"type": "string",
"enum": ["500", "502", "503", "504"]
},
"region": {
"type": "string",
"pattern": "^[a-z]+-[a-z]+[0-9]+$"
},
"start_time": {
"type": "string",
"format": "date-time"
}
}
}
}
}
}
A valid response from the model would look like:
{
"service_name": "validation",
"error_code": "503",
"region": "us-west1",
"start_time": "2026-01-22T10:42:00Z"
}
If the model:
- Omits a required field
- Returns an invalid enum value
- Produces malformed JSON
the response can be deterministically rejected and retried.
3. Using models to verify models
Although LLMs are non-deterministic, techniques such as consistency checks, cross-validation, and self-reflection have proven effective in practice. Redundancy can mitigate variance.
consistency checking: generate multiple independent responses (via re-sampling, prompt variation, or temperature sweeps) and compare them for agreement on critical fields. Disagreement is a strong signal of uncertainty and can be used to trigger retries, fallback logic, or human review.
Another effective technique is cross-validation using specialized prompts or models. For example, one model (or prompt) produces an answer, while another is tasked solely with verification—checking factual correctness, schema adherence, or alignment with known constraints. Importantly, the verifier’s scope should be narrower and more deterministic than the generator’s.
Self-reflection and critique loops further improve reliability. In this pattern, the model is explicitly asked to inspect its own output for errors, missing assumptions, or violations of constraints. While not foolproof, this often catches obvious inconsistencies, incorrect identifiers, and logical gaps before results propagate downstream.
4. Control context window size aggressively
Modern LLMs support extremely large context windows, which can lead to overconfidence in passing large volumes of data. Beyond a task and model-specific threshold, stuffing a lot of data into the context window can introduce additional non-determinism, increasing variance rather than accuracy, and raise hallucination risk.
The optimal token budget depends on the model and the task. The only reliable way to determine this limit is benchmarking: incrementally increase context size on a representative dataset and observe accuracy degradation.
Not all available information should be included. Minimizing irrelevant or weakly related context improves determinism and reduces the probability of spurious correlations.
Hallucination rate is one of the key accuracy metrics in our AI SRE evaluation framework.
Taming Non-Determinism in Agentic AI for Production Reliability
In summary, hallucinations are not anomalies; they are an expected property of probabilistic systems. The solution is not to hope for perfect models, but to apply disciplined systems engineering: testing, validation, structure, redundancy, and controlled inputs.
Written by