Evaluation Framework

Standardized Benchmarks for Embodied AI

3 tasks × 7 scenes × public leaderboard. Reproducible evaluation across simulation and real hardware.

Submit Your Results

Benchmark Tasks

Three Core Evaluation Tasks

near_me

PointNav

Navigate to absolute coordinates. Metrics: Success Rate (SR) + Success weighted by Path Length (SPL).

100 episodes per evaluation

ObjectNav

Navigate to a semantic object category (e.g., ‘chair’, ‘table’). Metrics: SR + SPL + Discovery Distance.

100 episodes per evaluation

explore

Exploration

Maximize coverage of unknown environment within time budget. Metrics: Coverage % + Efficiency ratio.

100 episodes per evaluation

Rankings

Public Leaderboard

Rank	Method	Task	Backend	SR (%)	SPL	Date	Code
Loading leaderboard data…

Get Listed

Submit Your Results

Run the benchmark CLI

Execute the standardized benchmark suite with your method against the chosen task and backend.

Results JSON is generated

Output is saved at ./results/benchmark_pointnav.json

Submit via GitHub PR

Open a pull request to the leaderboard repository with your results JSON and method description.

Terminal

$ threewe benchmark run \
    --task pointnav \
    --episodes 100 \
    --backend gazebo

[INFO] Loading benchmark suite...
[INFO] Task: PointNav | Episodes: 100
[INFO] Backend: Gazebo Harmonic
[INFO] Running episode 1/100...
...
[OK] Benchmark complete.
[OK] Results saved to ./results/benchmark_pointnav.json

Methodology

Evaluation Protocol

Our evaluation protocol ensures fair, reproducible comparisons across all submissions.

lock

Fixed Seeds

All evaluations use fixed seeds for reproducibility across runs and hardware.

replay

Minimum 100 Episodes

Each task requires a minimum of 100 episodes to ensure statistical significance.

route

Full Trajectory Logs

Results must include full trajectory logs for verification and analysis.

swap_horiz

Sim + Real Accepted

Both simulation and real-hardware submissions are accepted and tracked separately.

videocam

Video Evidence

Real-hardware submissions require video evidence of at least 10 representative episodes.

code

Open Source Code

All submissions must link to publicly available source code for reproducibility.

Get Started

Ready to Benchmark?

Run the standardized evaluation suite on your method and join the public leaderboard.

Run Benchmarks View Protocol Docs