NeuBird Launches Free Self-Service Trial of Its 24×7 On-Call AI SRE

December 16, 2025 Thought Leadership

Telemetry Best Practices for AIOps

At AWS Re:Invent, we shared how our AI Site Reliability Engineer (SRE) Agent, Hawkeye, can connect to telemetry services and provide automated incident response and root cause analysis. On more than one occasion in my discussions, I was asked about the kind of telemetry needed for an AI tool to effectively resolve alerts. Digging deeper, their stories usually boil down to one or more of the following:

  • They explored free tiers with tools like Datadog and New Relic
  • They use whatever comes with their cloud providers, like CloudWatch
  • They have a bundle of firing alerts that they ignore because it’s simply not a priority to address them

This is the reality for many agile, small organizations. There is one common theme:

There is never enough time for development and telemetry takes a backseat. 

For many of the hundreds of people we spoke with, setting up elaborate monitoring and telemetry is a luxury that can either be bought at a very high cost or must be forfeited at the eventual cost of a major outage and customer complaints. 

Having little to no telemetry configured in a small codebase (and organization) is okay. But as you grow, consider this: There is no replacement for fundamentally sound data. An engineer without proper telemetry during a crisis will analyze the wrong signals and waste time on irrelevant logs. This is why we take time with each of our customers to ensure that proper telemetry is set up (more on that later).

There is a middle ground between using elaborate tools with an army and letting your customers feel the full blunt force of your latest commit – building a good foundation with these three best practices: log quality, business context, and standardization, allows both humans and AI agents to be more effective in resolving issues. 

Log Quality

Just as human operators struggle with correlating ambiguous or missing data, LLMs and AI agents cannot perform effectively when fed poor-quality telemetry. Logs are often the richest source of truth, and providing some structure to it can increase its effectiveness manyfold. Let’s take a look at some examples using CloudWatch.

1- Use Structured Logs

By structuring your logs in JSON format, CloudWatch can automatically extract the fields and make them searchable using SQL-like queries. 

Here is an example of a structured JSON log:

{
  "timestamp": "2025-12-12T20:15:42.123Z",
  "event": "request_processed",
  "user_id": "user_123",
  "client_ip": "10.0.0.5",
  "success": true,
  "latency_ms": 112
}

And an example CloudWatch Log Insights query to view 100 results ordered by latency:

fields timestamp, event, user_id, client_ip, success, latency_ms
| sort latency_ms desc
| limit 100

Versus an unstructured log file:

[2025-12-12T20:15:42Z] request_processed user_id=user_123 client_ip=10.0.0.5 success=true latency=112
[2025-12-12T20:15:44Z] request_processed client_ip=10.0.0.6 user_id=user_124 latency=95 success=true
[2025-12-12T20:15:46Z] login_attempt user_id=user_125 client_ip=10.0.0.7 success=true latency=120

With the variability of an unstructured log, parse and regex is used to read every line and match the patterns:

parse @message /user_id=(?<user_id>[^\s]+)/
| parse @message /client_ip=(?<client_ip>[^\s]+)/
| parse @message /success=(?<success>[^\s]+)/
| parse @message /latency=(?<latency_ms>\d+)/
| parse @message /\[(?<timestamp>[^\]]+)\] (?<event>\w+)/
| display timestamp, event, user_id, client_ip, success, latency_ms
| sort @latency_ms desc

This results in slower and more expensive queries. And while LLMs are very good at generating regular expressions, any deviation in the log format could render the query inaccurate. 

2- Include Trace / Request IDs

Generate and pass along a unique ID across services to enhance tracing capabilities. 

def service_a(user_id, client_ip, action):
    # Generate trace_id at the entry point of the request
    trace_id = str(uuid.uuid4()) 
    log_event("ServiceA received request", user_id, client_ip, action, trace_id)

    # Call downstream service with the same trace_id
    service_b(user_id, action, trace_id)

Propagating the unique ID (trace_id) to downstream services allows AI agents to easily track the entire end-to-end request flow. Here is an example CloudWatch query that an agent can use to filter the full journey of a trace_id:

filter trace_id = "abc-123-def-456"
| sort @timestamp asc
| display @timestamp, service_name, message, latency_ms

3- Standardize Timestamps

Consistent timestamps enable cross-service & cross-region correlation. Using UTC plus ISO 8601 format ensures events are analyzed in the right sequence. Without standardization, logs could appear out of order across services or regions, which will decrease the accuracy of the root cause analysis.

Business Context

The addition of business context in telemetry allows AI agents to surface up decisions that previously required human judgment. For example, prioritizing a single platinum customer’s failed transaction over hundreds of bronze-tier failures, automatically triggering executive escalations based on revenue impact rather than error count, and providing actionable insights like “payment gateway X costs 15% more but has 50% fewer failures for high-value transactions.” This is the difference between AI that simply summarizes alerts and AI that operates as an intelligent business partner.

And it’s not about massive infrastructure changes or expensive tools. It’s about adding 5-10 fields to your existing logs, identifying your business-critical code paths, and gradually expanding based on what questions you actually need answered:

logger.info("Payment processing failed", extra={
    "event_type": "payment.failed",
    "customer_tier": "platinum",
    "revenue_lost_usd": 8500,
    "churn_risk": 0.7,
    "error_type": "gateway_timeout",
    "escalation_required": True
})

Even with this simple example, you will be able to identify:

  • Total business impact of downtime
  • VIP customer crisis detection
  • Escalation queue for support

Standardization

Your organization likely uses a number of these key-value pairs: “env:prod”, “tier:1”, “region:west”. They are arbitrary tags that engineers choose to assign to application logs, infrastructure resources, and dashboards. You have your established toolsets, bookmarked dashboards, automated pipelines, and tribal know-how’s passed down on a Thursday afternoon. 

AI agents don’t necessarily have that context. They thrive with structure and consistent schema. For this reason, using an open source standard such as OpenTelemetry is recommended. OpenTelemetry is widely supported across multiple platforms and it provides semantics for a shared vocabulary of attribute names.

When your telemetry speaks a standardized language, AI systems can immediately understand relationships across your entire infrastructure. AI no longer interprets text; it navigates a knowledge graph where relationships, causality, and meaning are explicit. This is the difference between an AI trying to read a novel about your system versus an AI examining a structured database of system behavior.

What’s Next

Now that you have a solid foundation of telemetry, you can leverage AI agents to significantly reduce your mean time to resolve (MTTR) incidents. NeuBird.ai offers an AI SRE agent called Hawkeye, which will automatically respond to alerts, precisely read the relevant telemetry data, conduct a deep context analysis, and provide remediation steps within minutes. Best of all, it is available as a self-service free trial. 

And, if you need a hand at configuring good telemetry for an AI agent, we can help you do that too.

Written by

Technical Marketing Engineer

Andrew Lee

# # # # # #