ClawProBench

Transparent live-first benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, with deterministic scoring and multi-dimensional assessment.

ClawProBench is a benchmark harness for evaluating LLM Agent capability inside the OpenClaw runtime, maintained by suyoumo. Current version v1.1.2, licensed under Apache-2.0.

Core Features#

Live Execution: All evaluations execute in the local OpenClaw runtime—no simulation
Deterministic Scoring: Scenario-level deterministic automated scoring via the custom_checks module with pluggable extension
Multi-Trial Reliability: Supports --trials N; core leaderboard metric is pass^3 (3-trial full pass rate)
Composite Score: FinalScore = 100 × S^0.40 × r_all^0.45 × r_any^0.15, integrating average_score, pass^3, and pass@3
Checkpoint Resume: Supports --continue, --resume-from, --rerun-execution-failures
Isolated Execution: Same-machine isolation via --openclaw-profile, --openclaw-state-dir, etc.

Assessment Dimensions & Profiles#

Eight dimensions: Capability, Efficiency, Planning, Safety, Tool Use, Constraints, Error Recovery, and Synthesis.

Five profiles:

core: 26 scenarios
intelligence: 95 scenarios
coverage: 7 scenarios
native: 36 scenarios
full: 102 scenarios

Architecture#

ClawProBench/
├── run.py                 # CLI entry (inventory / dry / run / compare)
├── harness/               # Core evaluation engine
│   ├── loader             # Scenario loader
│   ├── runner             # Runner
│   ├── scoring            # Scoring logic
│   ├── reporting          # Report generation
│   └── live OpenClaw bridge  # OpenClaw runtime bridge
├── scenarios/             # YAML scenario definitions
├── datasets/              # Seed data & setup/teardown scripts
├── custom_checks/         # Scenario-specific scoring
├── frameworks/            # Framework adapters
├── mock_tools/            # Mock tools
├── config/                # Configuration files
├── tests/                 # Regression tests
└── results/               # Generated output

Key design: Scenarios defined in YAML, deeply integrated with the OpenClaw runtime via live bridge for true execution; supports multi-provider integration (bailiancodingplan, volcengine-plan, siliconflow, openrouter, etc.).

Official repo/docs link for the OpenClaw runtime not provided in README
No associated academic paper found
No HuggingFace page found
Leaderboard data is self-run by the author team; no independent third-party audit mechanism mentioned

Core Features#

Assessment Dimensions & Profiles#

Architecture#

Report Output#

Leaderboard Ecosystem#

Unconfirmed Items#

Related Projects

Genkit

Gobii Platform

Semble

STAY UPDATED