HolmesGPT

Open-source SRE Agent for production incident investigation and root cause analysis, integrating 38+ observability data sources with 24/7 passive health checking.

HolmesGPT is a CNCF Sandbox open-source SRE Agent that performs root cause analysis by actively querying multiple observability data sources through an agentic loop. It ships with 38+ built-in Toolsets covering Prometheus, Loki, Elasticsearch, Kubernetes, AWS/Azure/GCP, PagerDuty, Jira, and more, with extensibility via the MCP (Model Context Protocol).

Core Capabilities#

Root Cause Analysis Engine#

Agentic Loop Reasoning: Autonomously plans query steps, iteratively invokes data source tools, and converges on root causes
Structured Segmented Output: Customizable segments including Alert Explanation, Conclusions and Possible Root Causes, Related Logs, with per-section prompt customization
Multi-LLM Support: OpenAI, Anthropic, Azure AI Foundry, AWS Bedrock, Google Gemini/Vertex AI, Ollama, OpenRouter, GitHub Models, OpenAI-Compatible, Robusta AI, and custom providers

Large-Scale Data Processing#

Server-side filtering, JSON tree traversal, and Tool Output Transformers to keep large payloads out of the LLM context window
Per-tool memory limits, streaming large results to disk, and automatic output budget management to prevent OOM on petabyte-scale observability datasets

Operator Mode (24/7 Passive Health Checking)#

Runs as a Kubernetes Operator for continuous background health checks
Deployment Verification: Co-deployed with business Pods to auto-verify health after new version releases
Regression Capture: Continuous monitoring to catch issues before users notice
Automated Remediation Chain: Issue detected → Slack notification → GitHub PR auto-created for fixes

Bidirectional Alert Integration#

Ingest: Pulls alerts from AlertManager / PagerDuty / OpsGenie / Jira to trigger investigations
Write-back: Posts analysis results back to source systems or pushes to Slack / Microsoft Teams

Data Source Coverage#

Category	Representative Sources
Metrics/Alerts	Prometheus, AlertManager, VictoriaMetrics
Logs	Loki, Elasticsearch/OpenSearch, Splunk, Coralogix, VictoriaLogs
Distributed Tracing	Tempo, Datadog, NewRelic
Container Orchestration	Kubernetes, OpenShift, ArgoCD, Helm, Docker
Cloud Platforms (MCP)	AWS, Azure, GCP
Databases	PostgreSQL, MySQL, ClickHouse, MariaDB, SQL Server, MongoDB
Message Queues	Kafka, RabbitMQ
Alerting/Ticketing	PagerDuty, OpsGenie, Jira, ServiceNow, Sentry, Zabbix
Knowledge/Docs	Confluence, Notion, Slab, Internet
CI/CD	GitHub, Jenkins (MCP)
Other	Crossplane, Robusta, Prefect (MCP), Kubernetes Remediation (MCP), Cilium, KubeVela

Architecture Highlights#

Module Structure: holmes/ main Agent logic and Toolset management, holmes_operator/ Kubernetes Operator implementation, experimental/ag-ui/ experimental Web UI, helm/ Helm Chart
Toolset Abstraction Layer: Each data source implements a group of tools, uniformly registered with the Agent; some integrated externally via MCP protocol
Unified LLM Interface: Multiple providers abstracted into a unified calling interface via OpenAI-Compatible protocol
Security Design: Read-only defaults, RBAC compliance, designed for safe production use
Dependency Management: Poetry (pyproject.toml, poetry.lock), Pre-commit hooks
Benchmarking: 150+ test scenarios comparing different LLMs on root cause analysis tasks

Installation & Usage#

CLI Installation (Homebrew):

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

Pipx Installation:

pipx install holmesgpt

Quick Start:

export ANTHROPIC_API_KEY="your-api-key"
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5-20250929"

Other Deployment Modes: Helm Chart (K8s Operator), Docker Compose (HTTP Server), Slack/Teams Bot (via Robusta), Backstage plugin, K9s plugin, Python SDK

Typical Scenarios#

Production Incident Investigation: Ingest alerts (e.g., CrashLoopBackOff), auto-query K8s logs/events/Prometheus metrics, output root cause analysis and remediation suggestions
CI/CD Troubleshooting: Analyze build logs and deployment status to locate pipeline failures
Prometheus Alert Investigation: Auto-query related metrics, generate PromQL charts, explain alert semantics
Passive Health Checking: Operator mode runs scheduled health checks to catch issues before users notice
Deployment Verification: Auto-verify service health after new version releases, with rollback or notification on anomalies

Created by Robusta.Dev with significant contributions from Microsoft, licensed under Apache-2.0, latest release v0.26.0.

Core Capabilities#

Root Cause Analysis Engine#

Large-Scale Data Processing#

Operator Mode (24/7 Passive Health Checking)#

Bidirectional Alert Integration#

Data Source Coverage#

Architecture Highlights#

Installation & Usage#

Typical Scenarios#

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED