Transparent live-first benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, with deterministic scoring and multi-dimensional assessment.
ClawProBench is a benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, maintained by suyoumo. Current version v1.1.2, licensed under Apache-2.0.
Core Features#
- Live Execution: All evaluations execute in the local OpenClaw runtime—no simulation
- Deterministic Scoring: Scenario-level deterministic automated scoring via the custom_checks module with pluggable extension
- Multi-Trial Reliability: Supports
--trials N; core leaderboard metric is pass^3 (3-trial full pass rate) - Composite Score:
FinalScore = 100 × S^0.40 × r_all^0.45 × r_any^0.15, integrating average_score, pass^3, and pass@3 - Checkpoint Resume: Supports
--continue,--resume-from,--rerun-execution-failures - Isolated Execution: Same-machine isolation via
--openclaw-profile,--openclaw-state-dir, etc.
Assessment Dimensions & Profiles#
Eight dimensions: Capability, Efficiency, Planning, Safety, Tool Use, Constraints, Error Recovery, and Synthesis.
Five profiles:
- core: 26 scenarios
- intelligence: 95 scenarios
- coverage: 7 scenarios
- native: 36 scenarios
- full: 102 scenarios
Architecture#
ClawProBench/
├── run.py # CLI entry (inventory / dry / run / compare)
├── harness/ # Core evaluation engine
│ ├── loader # Scenario loader
│ ├── runner # Runner
│ ├── scoring # Scoring logic
│ ├── reporting # Report generation
│ └── live OpenClaw bridge # OpenClaw runtime bridge
├── scenarios/ # YAML scenario definitions
├── datasets/ # Seed data & setup/teardown scripts
├── custom_checks/ # Scenario-specific scoring
├── frameworks/ # Framework adapters
├── mock_tools/ # Mock tools
├── config/ # Configuration files
├── tests/ # Regression tests
└── results/ # Generated output
Key design: Scenarios defined in YAML, deeply integrated with the OpenClaw runtime via live bridge for true execution; supports multi-provider integration (bailiancodingplan, volcengine-plan, siliconflow, openrouter, etc.).
Report Output#
Outputs avg_score, max_score, coverage summary, cost, latency, resume metadata, with a compare subcommand for comparative reports.
Leaderboard Ecosystem#
Public leaderboard covers 57+ models including DeepSeek, Qwen, GLM, Claude, GPT, Kimi, Mimo, Doubao, Gemini, Hunyuan, LongCat, Ling, MiniMax, with third-party provider onboarding support.
Unconfirmed Items#
- Official repo/docs link for the OpenClaw runtime not provided in README
- No associated academic paper found
- No HuggingFace page found
- Leaderboard data is self-run by the author team; no independent third-party audit mechanism mentioned