基准测试

3we 基准测试套件提供标准化的任务和指标，用于评估导航策略、探索策略和基于 VLM 的智能体。结果可以提交到公共排行榜，与其他方法进行比较。

运行基准测试

使用 CLI 运行完整的基准测试套件：

threewe benchmark run

这将在仿真中执行所有已注册的任务，并生成结果 JSON 文件。

选项

threewe benchmark run --task navigation    # run only navigation tasks
threewe benchmark run --task exploration   # run only exploration tasks
threewe benchmark run --task vlm           # run only VLM instruction tasks
threewe benchmark run --episodes 100       # override episode count
threewe benchmark run --seed 42            # set random seed for reproducibility
threewe benchmark run --output results.json

任务类型

导航任务

在已知和未知环境中进行点对点导航。机器人必须在避开障碍物的同时到达目标位置。

指标：成功率、SPL（按路径长度加权的成功率）、碰撞率、平均时间。

探索任务

在时间预算内最大化对未知环境的覆盖率。

指标：覆盖面积 (%)、达到 90% 覆盖率的时间、重访次数。

VLM 指令任务

遵循自然语言指令到达语义目标（例如，“去厨房桌子那里”）。

指标：成功率、轨迹效率、指令完成准确率。

场景列表

基准测试在多样化的场景集合中进行评估：

场景 ID	描述	尺寸 (m)	障碍物	难度
`empty_room`	开放的 5x5 房间，无障碍物	5x5	0	简单
`simple_corridor`	L 形走廊	8x3	2	简单
`cluttered_room`	有家具的房间	6x6	12	中等
`maze_small`	网格迷宫，3x3 单元格	6x6	墙壁	中等
`office_floor`	多房间办公布局	15x10	25+	困难
`warehouse`	带货架通道的开放空间	20x15	40+	困难
`apartment`	带房间的真实公寓	12x8	30+	困难
`dynamic_pedestrians`	有移动行人的房间	8x8	5 + 3 移动	困难

运行自定义策略

注册您的策略并在基准测试中运行：

from threewe.benchmark import BenchmarkRunner, register_policy

@register_policy("my_policy")
class MyPolicy:
    def __init__(self, env):
        self.env = env

    def act(self, observation):
        # Your policy logic here
        return action

runner = BenchmarkRunner(policy="my_policy")
results = runner.run(tasks=["navigation"], scenes=["cluttered_room"])
print(results.summary())

提交结果

运行基准测试后，将结果提交到公共排行榜：

threewe benchmark submit results.json --name "My Method" --paper "https://arxiv.org/abs/..."

提交内容会在服务器端通过使用您提供的代码或检查点重新运行部分回合来进行验证。

基线结果

方法	导航成功率	导航 SPL	探索 90% (s)	VLM 成功率
Random	5.2%	0.03	—	2.1%
Bug2	62.4%	0.48	—	—
DWA Planner	78.1%	0.61	145s	—
Frontier Exploration	—	—	89s	—
PPO (3we default)	84.3%	0.72	67s	—
VLM Nav (GPT-4o)	—	—	—	71.5%

可重复性

所有基准测试运行都会记录：

threewe 包的 Git commit hash
完整的配置 YAML
随机种子
系统信息（CPU、GPU、OS）

这些元数据与结果一起存储，以确保完全可重复性。