Benchmarks
The 3we benchmark suite provides standardized tasks and metrics for evaluating navigation policies, exploration strategies, and VLM-based agents. Results can be submitted to the public leaderboard for comparison with other approaches.
Running Benchmarks
Section titled “Running Benchmarks”Use the CLI to run the full benchmark suite:
threewe benchmark runThis executes all registered tasks in simulation and produces a results JSON file.
Options
Section titled “Options”threewe benchmark run --task navigation # run only navigation tasksthreewe benchmark run --task exploration # run only exploration tasksthreewe benchmark run --task vlm # run only VLM instruction tasksthreewe benchmark run --episodes 100 # override episode countthreewe benchmark run --seed 42 # set random seed for reproducibilitythreewe benchmark run --output results.jsonTask Types
Section titled “Task Types”Navigation Tasks
Section titled “Navigation Tasks”Point-to-point navigation in known and unknown environments. The robot must reach a goal position while avoiding obstacles.
Metrics: Success rate, SPL (Success weighted by Path Length), collision rate, average time.
Exploration Tasks
Section titled “Exploration Tasks”Maximize coverage of an unknown environment within a time budget.
Metrics: Area covered (%), time to 90% coverage, number of revisits.
VLM Instruction Tasks
Section titled “VLM Instruction Tasks”Follow natural language instructions to reach semantic goals (e.g., “go to the kitchen table”).
Metrics: Success rate, trajectory efficiency, instruction completion accuracy.
Scene List
Section titled “Scene List”Benchmarks are evaluated across a diverse set of scenes:
| Scene ID | Description | Size (m) | Obstacles | Difficulty |
|---|---|---|---|---|
empty_room | Open 5x5 room, no obstacles | 5x5 | 0 | Easy |
simple_corridor | L-shaped hallway | 8x3 | 2 | Easy |
cluttered_room | Room with furniture | 6x6 | 12 | Medium |
maze_small | Grid maze, 3x3 cells | 6x6 | walls | Medium |
office_floor | Multi-room office layout | 15x10 | 25+ | Hard |
warehouse | Open space with shelving aisles | 20x15 | 40+ | Hard |
apartment | Realistic apartment with rooms | 12x8 | 30+ | Hard |
dynamic_pedestrians | Room with moving agents | 8x8 | 5 + 3 moving | Hard |
Running a Custom Policy
Section titled “Running a Custom Policy”Register your policy and run it against the benchmark:
from threewe.benchmark import BenchmarkRunner, register_policy
@register_policy("my_policy")class MyPolicy: def __init__(self, env): self.env = env
def act(self, observation): # Your policy logic here return action
runner = BenchmarkRunner(policy="my_policy")results = runner.run(tasks=["navigation"], scenes=["cluttered_room"])print(results.summary())Submitting Results
Section titled “Submitting Results”After running the benchmark, submit your results to the public leaderboard:
threewe benchmark submit results.json --name "My Method" --paper "https://arxiv.org/abs/..."Submissions are validated server-side by re-running a subset of episodes with your provided code or checkpoint.
Baseline Results
Section titled “Baseline Results”| Method | Nav. Success | Nav. SPL | Explore 90% (s) | VLM Success |
|---|---|---|---|---|
| Random | 5.2% | 0.03 | — | 2.1% |
| Bug2 | 62.4% | 0.48 | — | — |
| DWA Planner | 78.1% | 0.61 | 145s | — |
| Frontier Exploration | — | — | 89s | — |
| PPO (3we default) | 84.3% | 0.72 | 67s | — |
| VLM Nav (GPT-4o) | — | — | — | 71.5% |
Reproducibility
Section titled “Reproducibility”All benchmark runs record:
- Git commit hash of the
threewepackage - Full configuration YAML
- Random seeds
- System information (CPU, GPU, OS)
This metadata is stored alongside results for full reproducibility.