ClawProBench

面向 OpenClaw 运行时的 LLM Agent 评测基准，支持实时执行、确定性评分与多维度能力评估。

ClawProBench 是一个面向 OpenClaw 运行时的 LLM Agent 评测基准框架，由 suyoumo 直接维护，当前版本 v1.1.2，采用 Apache-2.0 开源许可。

核心特性#

实时执行评测：所有评测在本地 OpenClaw 运行时中真实执行，非模拟
确定性自动评分：通过 custom_checks 模块实现场景级别的确定性判断，评分逻辑可插拔扩展
多次试验可靠性：支持 --trials N 多次运行，核心排行榜指标为 pass^3（3 次全部通过率）
复合评分体系：FinalScore = 100 × S^0.40 × r_all^0.45 × r_any^0.15，综合 average_score、pass^3、pass@3
断点续跑：支持 --continue、--resume-from、--rerun-execution-failures
隔离运行：通过 --openclaw-profile、--openclaw-state-dir 等实现同机隔离

评估维度与 Profile#

覆盖八大评估维度：Capability（能力）、Efficiency（效率）、Planning（规划）、Safety（安全）、Tool Use（工具使用）、Constraints（约束）、Error Recovery（错误恢复）、Synthesis（综合）。

五种评测 Profile：

core：26 场景（核心评估）
intelligence：95 场景（智能深度评估）
coverage：7 场景（覆盖度评估）
native：36 场景（原生能力评估）
full：102 场景（全量评估）

架构设计#

ClawProBench/
├── run.py                 # CLI 入口（inventory / dry / run / compare）
├── harness/               # 核心评测引擎
│   ├── loader             # 场景加载器
│   ├── runner             # 运行器
│   ├── scoring            # 评分逻辑
│   ├── reporting          # 报告生成
│   └── live OpenClaw bridge  # 与 OpenClaw 运行时桥接
├── scenarios/             # YAML 格式评测场景定义
├── datasets/              # 种子数据及 setup/teardown 脚本
├── custom_checks/         # 场景特定评分逻辑
├── frameworks/            # 框架适配
├── mock_tools/            # 模拟工具
├── config/                # 配置文件
├── tests/                 # 回归测试
└── results/               # 生成输出

关键设计：评测任务以 YAML 定义，通过 live OpenClaw bridge 与运行时深度集成实现真实执行；支持 bailiancodingplan、volcengine-plan、siliconflow、openrouter 等多 Provider 接入。

报告输出#

输出 avg_score、max_score、coverage 摘要、cost、latency、resume 元数据，并提供 compare 子命令生成对比报告。

排行榜生态#

公开排行榜已覆盖 57+ 模型，包括 DeepSeek、Qwen、GLM、Claude、GPT、Kimi、Mimo、Doubao、Gemini、Hunyuan、LongCat、Ling、MiniMax 等，支持第三方模型提供商接入。

待确认事项#

OpenClaw 运行时的官方仓库/文档链接未在 README 中提供
未发现关联学术论文
未发现 HuggingFace 页面
排行榜数据由作者团队自行运行发布，尚未提及独立第三方审计机制

核心特性#

评估维度与 Profile#

架构设计#

报告输出#

排行榜生态#

待确认事项#

相关项目

Genkit

Gobii Platform

Semble

保持更新