Standardized Benchmarks for Embodied AI
3 tasks × 7 scenes × public leaderboard. Reproducible evaluation across simulation and real hardware.
Submit Your ResultsThree Core Evaluation Tasks
PointNav
Navigate to absolute coordinates. Metrics: Success Rate (SR) + Success weighted by Path Length (SPL).
100 episodes per evaluationObjectNav
Navigate to a semantic object category (e.g., ‘chair’, ‘table’). Metrics: SR + SPL + Discovery Distance.
100 episodes per evaluationExploration
Maximize coverage of unknown environment within time budget. Metrics: Coverage % + Efficiency ratio.
100 episodes per evaluationPublic Leaderboard
| Rank | Method | Task | Backend | SR (%) | SPL | Date | Code |
|---|---|---|---|---|---|---|---|
| Loading leaderboard data… | |||||||
Submit Your Results
Run the benchmark CLI
Execute the standardized benchmark suite with your method against the chosen task and backend.
Results JSON is generated
Output is saved at
./results/benchmark_pointnav.json
Submit via GitHub PR
Open a pull request to the leaderboard repository with your results JSON and method description.
$ threewe benchmark run \ --task pointnav \ --episodes 100 \ --backend gazebo [INFO] Loading benchmark suite... [INFO] Task: PointNav | Episodes: 100 [INFO] Backend: Gazebo Harmonic [INFO] Running episode 1/100... ... [OK] Benchmark complete. [OK] Results saved to ./results/benchmark_pointnav.json
Evaluation Protocol
Our evaluation protocol ensures fair, reproducible comparisons across all submissions.
Fixed Seeds
All evaluations use fixed seeds for reproducibility across runs and hardware.
Minimum 100 Episodes
Each task requires a minimum of 100 episodes to ensure statistical significance.
Full Trajectory Logs
Results must include full trajectory logs for verification and analysis.
Sim + Real Accepted
Both simulation and real-hardware submissions are accepted and tracked separately.
Video Evidence
Real-hardware submissions require video evidence of at least 10 representative episodes.
Open Source Code
All submissions must link to publicly available source code for reproducibility.
Ready to Benchmark?
Run the standardized evaluation suite on your method and join the public leaderboard.