Open-source SRE Agent for production incident investigation and root cause analysis, integrating 38+ observability data sources with 24/7 passive health checking.
HolmesGPT is a CNCF Sandbox open-source SRE Agent that performs root cause analysis by actively querying multiple observability data sources through an agentic loop. It ships with 38+ built-in Toolsets covering Prometheus, Loki, Elasticsearch, Kubernetes, AWS/Azure/GCP, PagerDuty, Jira, and more, with extensibility via the MCP (Model Context Protocol).
Core Capabilities#
Root Cause Analysis Engine#
- Agentic Loop Reasoning: Autonomously plans query steps, iteratively invokes data source tools, and converges on root causes
- Structured Segmented Output: Customizable segments including Alert Explanation, Conclusions and Possible Root Causes, Related Logs, with per-section prompt customization
- Multi-LLM Support: OpenAI, Anthropic, Azure AI Foundry, AWS Bedrock, Google Gemini/Vertex AI, Ollama, OpenRouter, GitHub Models, OpenAI-Compatible, Robusta AI, and custom providers
Large-Scale Data Processing#
- Server-side filtering, JSON tree traversal, and Tool Output Transformers to keep large payloads out of the LLM context window
- Per-tool memory limits, streaming large results to disk, and automatic output budget management to prevent OOM on petabyte-scale observability datasets
Operator Mode (24/7 Passive Health Checking)#
- Runs as a Kubernetes Operator for continuous background health checks
- Deployment Verification: Co-deployed with business Pods to auto-verify health after new version releases
- Regression Capture: Continuous monitoring to catch issues before users notice
- Automated Remediation Chain: Issue detected → Slack notification → GitHub PR auto-created for fixes
Bidirectional Alert Integration#
- Ingest: Pulls alerts from AlertManager / PagerDuty / OpsGenie / Jira to trigger investigations
- Write-back: Posts analysis results back to source systems or pushes to Slack / Microsoft Teams
Data Source Coverage#
| Category | Representative Sources |
|---|---|
| Metrics/Alerts | Prometheus, AlertManager, VictoriaMetrics |
| Logs | Loki, Elasticsearch/OpenSearch, Splunk, Coralogix, VictoriaLogs |
| Distributed Tracing | Tempo, Datadog, NewRelic |
| Container Orchestration | Kubernetes, OpenShift, ArgoCD, Helm, Docker |
| Cloud Platforms (MCP) | AWS, Azure, GCP |
| Databases | PostgreSQL, MySQL, ClickHouse, MariaDB, SQL Server, MongoDB |
| Message Queues | Kafka, RabbitMQ |
| Alerting/Ticketing | PagerDuty, OpsGenie, Jira, ServiceNow, Sentry, Zabbix |
| Knowledge/Docs | Confluence, Notion, Slab, Internet |
| CI/CD | GitHub, Jenkins (MCP) |
| Other | Crossplane, Robusta, Prefect (MCP), Kubernetes Remediation (MCP), Cilium, KubeVela |
Architecture Highlights#
- Module Structure:
holmes/main Agent logic and Toolset management,holmes_operator/Kubernetes Operator implementation,experimental/ag-ui/experimental Web UI,helm/Helm Chart - Toolset Abstraction Layer: Each data source implements a group of tools, uniformly registered with the Agent; some integrated externally via MCP protocol
- Unified LLM Interface: Multiple providers abstracted into a unified calling interface via OpenAI-Compatible protocol
- Security Design: Read-only defaults, RBAC compliance, designed for safe production use
- Dependency Management: Poetry (pyproject.toml, poetry.lock), Pre-commit hooks
- Benchmarking: 150+ test scenarios comparing different LLMs on root cause analysis tasks
Installation & Usage#
CLI Installation (Homebrew):
brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
Pipx Installation:
pipx install holmesgpt
Quick Start:
export ANTHROPIC_API_KEY="your-api-key"
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5-20250929"
Other Deployment Modes: Helm Chart (K8s Operator), Docker Compose (HTTP Server), Slack/Teams Bot (via Robusta), Backstage plugin, K9s plugin, Python SDK
Typical Scenarios#
- Production Incident Investigation: Ingest alerts (e.g., CrashLoopBackOff), auto-query K8s logs/events/Prometheus metrics, output root cause analysis and remediation suggestions
- CI/CD Troubleshooting: Analyze build logs and deployment status to locate pipeline failures
- Prometheus Alert Investigation: Auto-query related metrics, generate PromQL charts, explain alert semantics
- Passive Health Checking: Operator mode runs scheduled health checks to catch issues before users notice
- Deployment Verification: Auto-verify service health after new version releases, with rollback or notification on anomalies
Created by Robusta.Dev with significant contributions from Microsoft, licensed under Apache-2.0, latest release v0.26.0.