Skip to content

Benchmarks

The 3we benchmark suite provides standardized tasks and metrics for evaluating navigation policies, exploration strategies, and VLM-based agents. Results can be submitted to the public leaderboard for comparison with other approaches.

Use the CLI to run the full benchmark suite:

Terminal window
threewe benchmark run

This executes all registered tasks in simulation and produces a results JSON file.

Terminal window
threewe benchmark run --task navigation # run only navigation tasks
threewe benchmark run --task exploration # run only exploration tasks
threewe benchmark run --task vlm # run only VLM instruction tasks
threewe benchmark run --episodes 100 # override episode count
threewe benchmark run --seed 42 # set random seed for reproducibility
threewe benchmark run --output results.json

Point-to-point navigation in known and unknown environments. The robot must reach a goal position while avoiding obstacles.

Metrics: Success rate, SPL (Success weighted by Path Length), collision rate, average time.

Maximize coverage of an unknown environment within a time budget.

Metrics: Area covered (%), time to 90% coverage, number of revisits.

Follow natural language instructions to reach semantic goals (e.g., “go to the kitchen table”).

Metrics: Success rate, trajectory efficiency, instruction completion accuracy.

Benchmarks are evaluated across a diverse set of scenes:

Scene IDDescriptionSize (m)ObstaclesDifficulty
empty_roomOpen 5x5 room, no obstacles5x50Easy
simple_corridorL-shaped hallway8x32Easy
cluttered_roomRoom with furniture6x612Medium
maze_smallGrid maze, 3x3 cells6x6wallsMedium
office_floorMulti-room office layout15x1025+Hard
warehouseOpen space with shelving aisles20x1540+Hard
apartmentRealistic apartment with rooms12x830+Hard
dynamic_pedestriansRoom with moving agents8x85 + 3 movingHard

Register your policy and run it against the benchmark:

from threewe.benchmark import BenchmarkRunner, register_policy
@register_policy("my_policy")
class MyPolicy:
def __init__(self, env):
self.env = env
def act(self, observation):
# Your policy logic here
return action
runner = BenchmarkRunner(policy="my_policy")
results = runner.run(tasks=["navigation"], scenes=["cluttered_room"])
print(results.summary())

After running the benchmark, submit your results to the public leaderboard:

Terminal window
threewe benchmark submit results.json --name "My Method" --paper "https://arxiv.org/abs/..."

Submissions are validated server-side by re-running a subset of episodes with your provided code or checkpoint.

MethodNav. SuccessNav. SPLExplore 90% (s)VLM Success
Random5.2%0.032.1%
Bug262.4%0.48
DWA Planner78.1%0.61145s
Frontier Exploration89s
PPO (3we default)84.3%0.7267s
VLM Nav (GPT-4o)71.5%

All benchmark runs record:

  • Git commit hash of the threewe package
  • Full configuration YAML
  • Random seeds
  • System information (CPU, GPU, OS)

This metadata is stored alongside results for full reproducibility.