DISCOVER THE FUTURE OF AI AGENTS

ClawProBench

Added May 3, 2026
Agent & Tooling
Open Source
PythonLarge Language ModelsAI AgentsCLIAgent & ToolingModel & Inference FrameworkEducation & Research Resources

Transparent live-first benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, with deterministic scoring and multi-dimensional assessment.

ClawProBench is a benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, maintained by suyoumo. Current version v1.1.2, licensed under Apache-2.0.

Core Features#

  • Live Execution: All evaluations execute in the local OpenClaw runtime—no simulation
  • Deterministic Scoring: Scenario-level deterministic automated scoring via the custom_checks module with pluggable extension
  • Multi-Trial Reliability: Supports --trials N; core leaderboard metric is pass^3 (3-trial full pass rate)
  • Composite Score: FinalScore = 100 × S^0.40 × r_all^0.45 × r_any^0.15, integrating average_score, pass^3, and pass@3
  • Checkpoint Resume: Supports --continue, --resume-from, --rerun-execution-failures
  • Isolated Execution: Same-machine isolation via --openclaw-profile, --openclaw-state-dir, etc.

Assessment Dimensions & Profiles#

Eight dimensions: Capability, Efficiency, Planning, Safety, Tool Use, Constraints, Error Recovery, and Synthesis.

Five profiles:

  • core: 26 scenarios
  • intelligence: 95 scenarios
  • coverage: 7 scenarios
  • native: 36 scenarios
  • full: 102 scenarios

Architecture#

ClawProBench/
├── run.py                 # CLI entry (inventory / dry / run / compare)
├── harness/               # Core evaluation engine
│   ├── loader             # Scenario loader
│   ├── runner             # Runner
│   ├── scoring            # Scoring logic
│   ├── reporting          # Report generation
│   └── live OpenClaw bridge  # OpenClaw runtime bridge
├── scenarios/             # YAML scenario definitions
├── datasets/              # Seed data & setup/teardown scripts
├── custom_checks/         # Scenario-specific scoring
├── frameworks/            # Framework adapters
├── mock_tools/            # Mock tools
├── config/                # Configuration files
├── tests/                 # Regression tests
└── results/               # Generated output

Key design: Scenarios defined in YAML, deeply integrated with the OpenClaw runtime via live bridge for true execution; supports multi-provider integration (bailiancodingplan, volcengine-plan, siliconflow, openrouter, etc.).

Report Output#

Outputs avg_score, max_score, coverage summary, cost, latency, resume metadata, with a compare subcommand for comparative reports.

Leaderboard Ecosystem#

Public leaderboard covers 57+ models including DeepSeek, Qwen, GLM, Claude, GPT, Kimi, Mimo, Doubao, Gemini, Hunyuan, LongCat, Ling, MiniMax, with third-party provider onboarding support.

Unconfirmed Items#

  • Official repo/docs link for the OpenClaw runtime not provided in README
  • No associated academic paper found
  • No HuggingFace page found
  • Leaderboard data is self-run by the author team; no independent third-party audit mechanism mentioned

Related Projects

View All

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.