Benchmarks

The 3we benchmark suite provides standardized tasks and metrics for evaluating navigation policies, exploration strategies, and VLM-based agents. Results can be submitted to the public leaderboard for comparison with other approaches.

Running Benchmarks

Use the CLI to run the full benchmark suite:

threewe benchmark run

This executes all registered tasks in simulation and produces a results JSON file.

Options

threewe benchmark run --task navigation    # run only navigation tasks
threewe benchmark run --task exploration   # run only exploration tasks
threewe benchmark run --task vlm           # run only VLM instruction tasks
threewe benchmark run --episodes 100       # override episode count
threewe benchmark run --seed 42            # set random seed for reproducibility
threewe benchmark run --output results.json

Task Types

Point-to-point navigation in known and unknown environments. The robot must reach a goal position while avoiding obstacles.

Metrics: Success rate, SPL (Success weighted by Path Length), collision rate, average time.

Exploration Tasks

Maximize coverage of an unknown environment within a time budget.

Metrics: Area covered (%), time to 90% coverage, number of revisits.

VLM Instruction Tasks

Follow natural language instructions to reach semantic goals (e.g., “go to the kitchen table”).

Metrics: Success rate, trajectory efficiency, instruction completion accuracy.

Scene List

Benchmarks are evaluated across a diverse set of scenes:

Scene ID	Description	Size (m)	Obstacles	Difficulty
`empty_room`	Open 5x5 room, no obstacles	5x5	0	Easy
`simple_corridor`	L-shaped hallway	8x3	2	Easy
`cluttered_room`	Room with furniture	6x6	12	Medium
`maze_small`	Grid maze, 3x3 cells	6x6	walls	Medium
`office_floor`	Multi-room office layout	15x10	25+	Hard
`warehouse`	Open space with shelving aisles	20x15	40+	Hard
`apartment`	Realistic apartment with rooms	12x8	30+	Hard
`dynamic_pedestrians`	Room with moving agents	8x8	5 + 3 moving	Hard

Running a Custom Policy

from threewe.benchmark import BenchmarkRunner, register_policy

@register_policy("my_policy")
class MyPolicy:
    def __init__(self, env):
        self.env = env

    def act(self, observation):
        # Your policy logic here
        return action

runner = BenchmarkRunner(policy="my_policy")
results = runner.run(tasks=["navigation"], scenes=["cluttered_room"])
print(results.summary())

Submitting Results

After running the benchmark, submit your results to the public leaderboard:

threewe benchmark submit results.json --name "My Method" --paper "https://arxiv.org/abs/..."

Submissions are validated server-side by re-running a subset of episodes with your provided code or checkpoint.

Baseline Results

Method	Nav. Success	Nav. SPL	Explore 90% (s)	VLM Success
Random	5.2%	0.03	—	2.1%
Bug2	62.4%	0.48	—	—
DWA Planner	78.1%	0.61	145s	—
Frontier Exploration	—	—	89s	—
PPO (3we default)	84.3%	0.72	67s	—
VLM Nav (GPT-4o)	—	—	—	71.5%

Reproducibility

All benchmark runs record:

Git commit hash of the threewe package
Full configuration YAML
Random seeds
System information (CPU, GPU, OS)

This metadata is stored alongside results for full reproducibility.