DISCOVER THE FUTURE OF AI AGENTS

HolmesGPT

Added May 4, 2026
Agent & Tooling
Open Source
PythonDockerLarge Language ModelsModel Context ProtocolAI AgentsCLIAgent & ToolingModel & Inference FrameworkAutomation, Workflow & RPAProtocol, API & Integration

Open-source SRE Agent for production incident investigation and root cause analysis, integrating 38+ observability data sources with 24/7 passive health checking.

HolmesGPT is a CNCF Sandbox open-source SRE Agent that performs root cause analysis by actively querying multiple observability data sources through an agentic loop. It ships with 38+ built-in Toolsets covering Prometheus, Loki, Elasticsearch, Kubernetes, AWS/Azure/GCP, PagerDuty, Jira, and more, with extensibility via the MCP (Model Context Protocol).

Core Capabilities#

Root Cause Analysis Engine#

  • Agentic Loop Reasoning: Autonomously plans query steps, iteratively invokes data source tools, and converges on root causes
  • Structured Segmented Output: Customizable segments including Alert Explanation, Conclusions and Possible Root Causes, Related Logs, with per-section prompt customization
  • Multi-LLM Support: OpenAI, Anthropic, Azure AI Foundry, AWS Bedrock, Google Gemini/Vertex AI, Ollama, OpenRouter, GitHub Models, OpenAI-Compatible, Robusta AI, and custom providers

Large-Scale Data Processing#

  • Server-side filtering, JSON tree traversal, and Tool Output Transformers to keep large payloads out of the LLM context window
  • Per-tool memory limits, streaming large results to disk, and automatic output budget management to prevent OOM on petabyte-scale observability datasets

Operator Mode (24/7 Passive Health Checking)#

  • Runs as a Kubernetes Operator for continuous background health checks
  • Deployment Verification: Co-deployed with business Pods to auto-verify health after new version releases
  • Regression Capture: Continuous monitoring to catch issues before users notice
  • Automated Remediation Chain: Issue detected → Slack notification → GitHub PR auto-created for fixes

Bidirectional Alert Integration#

  • Ingest: Pulls alerts from AlertManager / PagerDuty / OpsGenie / Jira to trigger investigations
  • Write-back: Posts analysis results back to source systems or pushes to Slack / Microsoft Teams

Data Source Coverage#

CategoryRepresentative Sources
Metrics/AlertsPrometheus, AlertManager, VictoriaMetrics
LogsLoki, Elasticsearch/OpenSearch, Splunk, Coralogix, VictoriaLogs
Distributed TracingTempo, Datadog, NewRelic
Container OrchestrationKubernetes, OpenShift, ArgoCD, Helm, Docker
Cloud Platforms (MCP)AWS, Azure, GCP
DatabasesPostgreSQL, MySQL, ClickHouse, MariaDB, SQL Server, MongoDB
Message QueuesKafka, RabbitMQ
Alerting/TicketingPagerDuty, OpsGenie, Jira, ServiceNow, Sentry, Zabbix
Knowledge/DocsConfluence, Notion, Slab, Internet
CI/CDGitHub, Jenkins (MCP)
OtherCrossplane, Robusta, Prefect (MCP), Kubernetes Remediation (MCP), Cilium, KubeVela

Architecture Highlights#

  • Module Structure: holmes/ main Agent logic and Toolset management, holmes_operator/ Kubernetes Operator implementation, experimental/ag-ui/ experimental Web UI, helm/ Helm Chart
  • Toolset Abstraction Layer: Each data source implements a group of tools, uniformly registered with the Agent; some integrated externally via MCP protocol
  • Unified LLM Interface: Multiple providers abstracted into a unified calling interface via OpenAI-Compatible protocol
  • Security Design: Read-only defaults, RBAC compliance, designed for safe production use
  • Dependency Management: Poetry (pyproject.toml, poetry.lock), Pre-commit hooks
  • Benchmarking: 150+ test scenarios comparing different LLMs on root cause analysis tasks

Installation & Usage#

CLI Installation (Homebrew):

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

Pipx Installation:

pipx install holmesgpt

Quick Start:

export ANTHROPIC_API_KEY="your-api-key"
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5-20250929"

Other Deployment Modes: Helm Chart (K8s Operator), Docker Compose (HTTP Server), Slack/Teams Bot (via Robusta), Backstage plugin, K9s plugin, Python SDK

Typical Scenarios#

  • Production Incident Investigation: Ingest alerts (e.g., CrashLoopBackOff), auto-query K8s logs/events/Prometheus metrics, output root cause analysis and remediation suggestions
  • CI/CD Troubleshooting: Analyze build logs and deployment status to locate pipeline failures
  • Prometheus Alert Investigation: Auto-query related metrics, generate PromQL charts, explain alert semantics
  • Passive Health Checking: Operator mode runs scheduled health checks to catch issues before users notice
  • Deployment Verification: Auto-verify service health after new version releases, with rollback or notification on anomalies

Created by Robusta.Dev with significant contributions from Microsoft, licensed under Apache-2.0, latest release v0.26.0.

Related Projects

View All

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.