June 5, 2025 Thought Leadership

Making KubeVirt Enterprise-Ready: Agentic SRE and the Future Beyond VMware

When Broadcom acquired VMware, it created more than industry headlines—it created an inflection point. For decades, VMware was the operational bedrock of enterprise IT. It wasn’t just about virtualization; it was the control plane for managing compute, bolstered by a rich ecosystem of observability, diagnostics, and IT automation tools.

Today, that control plane is shifting. Enterprises seeking a more cloud-native approach are rapidly exploring KubeVirt—an open-source extension of Kubernetes that enables VMs to run side-by-side with containers under a unified control plane. It’s elegant in theory, powerful in practice, but incomplete in one critical dimension: operability.

The Hidden Ingredient Behind VMware’s Success? Observability

VMware’s dominance was never just about hypervisors. Its real moat was its supporting ecosystem:

Telemetry tools that gave IT teams insight into what was happening
Remediation workflows that turned signals into actions
Compliance and diagnostics built into the fabric of VM management

That ecosystem meant enterprises could operate at scale and sleep at night.

But with KubeVirt, many of these layers are missing or fragmented. The Kubernetes-native world is rich with telemetry—from Prometheus to Datadog, OpenTelemetry, Splunk, New Relic, and more—but there’s no single operational glue that brings it together for virtual machine diagnostics, especially when VMs behave like legacy workloads in a modern cloud-native world.

The Problem Isn’t the Data—It’s the Noise

Modern telemetry is abundant, but context windows for reasoning (especially for GenAI agents) are narrow. Dumping metrics, logs, and traces into a dashboard or even a model doesn’t help if the signal-to-noise ratio is poor.

To make KubeVirt viable for real enterprise operations, we need systems that don’t just collect data—we need systems that can think. Systems that can surgically extract the right data across time, space, and observability surface to understand and resolve real incidents.

Enter Hawkeye: Agentic SRE for the KubeVirt Era

At Neubird, we’ve built Hawkeye—a production-grade, GenAI-powered agentic SRE system designed for Kubernetes, OpenShift, and yes, KubeVirt. Hawkeye is not just an observability overlay; it’s a reasoning engine that actively investigates and resolves incidents through a chain of thought.

Here’s how it works:

✅ Use Case 1: VM Crash or Freeze

Hawkeye receives an alert from Prometheus that a KubeVirt-managed VM is unresponsive.
It begins an iterative investigation, checking resource pressure via kubectl top node, then digs into host-level metrics (e.g., CPU throttling, memory swap) via Datadog or OpenTelemetry.
It queries logs in Splunk for correlated error events and examines Kubernetes events for pod eviction or node taints.
The agent surfaces root cause—the VM is scheduled on a node under memory pressure due to a runaway container.
It recommends (and can optionally trigger) a live migration of the VM to a healthier node using virtctl.

✅ Use Case 2: Network Connectivity Failure

A service running inside a VM suddenly becomes unreachable.
Hawkeye traces the service path—from KubeVirt network bridge to CNI plugin logs—and cross-checks against recent configuration changes using AWS Config or GitOps history.
It detects a misconfigured network policy applied via a recent Helm deployment and flags the exact commit.

✅ Use Case 3: High Disk I/O Latency

Alert from Datadog or Prometheus shows elevated I/O latency on a VM.
Hawkeye pulls PVC metrics and compares read/write patterns over the past 2 hours.
It inspects the host disk layer for other competing workloads and maps it back to node-specific diagnostics.
Through iterative narrowing, it identifies noisy neighbors causing contention—and suggests node affinity rules or PVC migration.

How Hawkeye Makes It Possible

Hawkeye integrates deep telemetry access and agentic reasoning with the following pillars:

🔍 Surgical Data Extraction: Filters telemetry to retrieve only the relevant data across time and context, minimizing model overload.
🔁 Iterative Chain-of-Thought: Models reason step by step, refining hypotheses like an SRE would in a war room.
📡 Multi-Source Observability: Hooks into Prometheus, Splunk, Datadog, AWS CloudWatch, OpenTelemetry, and direct kubectl/virtctl access to unify structured and unstructured signals.
🛠️ Agentic Actions: Not just detection—Hawkeye suggests or performs remediation actions (restart, migrate, patch, etc.) with audit tracking.

A New Era of Compute Needs a New Kind of SRE

If VMware was the old guard of virtualization—with an ecosystem built for the 2000s—KubeVirt represents the next generation: cloud-native, open, and extensible. But to make it viable in production, we need a modern operational brain to sit on top of the stack.

With Hawkeye, we’re making KubeVirt not just possible, but operable—by turning GenAI and telemetry into a surgical, intelligent, and agentic SRE that enterprises can trust.

Because deploying GenAI in infrastructure isn’t about who can do it first—it’s about who can do it responsibly, safely, and scalably.

Ready to see Hawkeye in action?
Drop us a note at neubird.ai and let’s talk agentic SRE for your KubeVirt stack.

Written by

Co-Founder

Vinod Jayaraman

Share VIA