VLM Navigation
The 3we platform supports Vision-Language Model (VLM) navigation, where natural language instructions are grounded in camera observations to produce robot actions. This enables instruction-following behaviors like “go to the red chair” or “navigate to the kitchen” without pre-built maps or waypoints.
Pipeline Overview
Section titled “Pipeline Overview”When you call robot.execute_instruction(...), the SDK launches a perception-action loop:
- Capture — The robot takes an image from its camera (
robot.get_image()) - Reason — The image + instruction are sent to GPT-4o (or any OpenAI-compatible VLM)
- Parse — The model returns a structured JSON action
- Execute — The robot executes the action (move forward, rotate, stop)
- Repeat — Until the model outputs
"done"or the step limit is reached
┌─────────────────────────────────────────────────────────────┐│ Perception-Action Loop ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ ││ │ Camera │───▶│ VLM API │───▶│ Parser │───▶│ Motor │ ││ │ Image │ │ GPT-4o │ │ JSON │ │ Cmd │ ││ └──────────┘ └──────────┘ └──────────┘ └───────┘ ││ ▲ │ ││ └───────────────────────────────────────────────┘ ││ Loop until "done" │└─────────────────────────────────────────────────────────────┘Basic Example
Section titled “Basic Example”import asynciofrom threewe import Robot
async def main(): async with Robot(backend="gazebo", scene="office_v2") as robot: result = await robot.execute_instruction( "Navigate to the blue door on the left" ) print(f"Success: {result.success}") print(f"Description: {result.description}") print(f"Images collected: {len(result.images)}")
asyncio.run(main())Using VLMRunner Directly
Section titled “Using VLMRunner Directly”For more control, use the VLMRunner class directly:
import asynciofrom threewe import Robotfrom threewe.ai.vlm_runner import VLMRunner
async def main(): runner = VLMRunner( model="gpt-4o", max_steps=30, temperature=0.0, )
async with Robot(backend="gazebo", scene="office_v2") as robot: image = robot.get_image() action_json = runner.plan(image, "go to the door on the left") print(f"VLM decided: {action_json}")
asyncio.run(main())Action Schema
Section titled “Action Schema”The VLM outputs exactly one JSON object per step:
{ "action": "move_forward", "distance": 0.5, "reason": "I can see a red bottle ahead on the right side"}Supported actions:
| Action | Parameters | Description |
|---|---|---|
move_forward | distance (meters) | Drive straight ahead |
rotate_left | angle (radians) | Turn counter-clockwise |
rotate_right | angle (radians) | Turn clockwise |
stop | — | Halt motion |
done | — | Task is complete |
Configuration
Section titled “Configuration”The VLM runner can be configured through environment variables:
# Use GPT-4o (default)export OPENAI_API_KEY="sk-..."
# Or use Qwen-VL via a compatible endpointexport THREEWE_VLM_MODEL="qwen-vl-max"export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
# Tune behaviorexport THREEWE_VLM_MAX_STEPS=30export THREEWE_VLM_TEMPERATURE=0.0Or create a runner from environment variables:
from threewe.ai.vlm_runner import VLMRunner
runner = VLMRunner.from_env()Model Options
Section titled “Model Options”| Backend | Model | Config |
|---|---|---|
| OpenAI | GPT-4o, GPT-4-turbo | OPENAI_API_KEY |
| Qwen | Qwen-VL-Max, Qwen-VL-Plus | THREEWE_VLM_BASE_URL + compatible key |
| Local | LLaVA, CogVLM | Custom base_url pointing to local server |
| Azure OpenAI | GPT-4o | Azure endpoint via base_url |
Any model that supports the OpenAI chat completions API with image inputs works out of the box.
Step-by-Step Debugging
Section titled “Step-by-Step Debugging”The on_step callback lets you see what the model thinks at each step:
import asynciofrom threewe import Robotfrom threewe.ai.vlm_runner import execute_vlm_instruction
def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None: print(f"\n--- Step {step_num} ---") print(f" Raw: {raw_response}") print(f" Action: {parsed_action.get('action', 'N/A')}") print(f" Reason: {parsed_action.get('reason', 'N/A')}")
async def main(): async with Robot(backend="gazebo", scene="office_v2") as robot: result = await execute_vlm_instruction( robot, instruction="navigate to the kitchen area", model="gpt-4o", max_steps=25, on_step=log_step, ) print(f"\nResult: success={result.success}")
asyncio.run(main())Chinese Language Support
Section titled “Chinese Language Support”The VLM runner detects CJK characters in instructions and adjusts its system prompt accordingly:
async with Robot(backend="gazebo") as robot: result = await robot.execute_instruction("找到红色的瓶子并停在它旁边") print(f"结果: {result.description}")The model will reason in Chinese for the reason field while keeping action keys in English for reliable parsing.
Safety
Section titled “Safety”The VLM navigator includes built-in safety layers:
- LiDAR override: Regardless of VLM output, the robot stops if an obstacle is within the safety distance (15cm default).
- Velocity limits: Hard-coded max 0.5 m/s linear, 1.0 rad/s angular.
- Communication watchdog: If no valid action is received within 200ms, the robot halts.
- Step limit: The loop terminates after
max_stepsiterations to prevent infinite loops.
Combining VLM with VLA Models
Section titled “Combining VLM with VLA Models”Use the VLM for high-level reasoning and a VLA for fine-grained motor control:
import asynciofrom threewe import Robotfrom threewe.ai.vlm_runner import VLMRunnerfrom threewe.ai.vla_runner import VLARunner
async def hybrid_control(): vlm = VLMRunner(model="gpt-4o", max_steps=10) vla = VLARunner.from_pretrained("lerobot/act_3we_nav")
async with Robot(backend="gazebo") as robot: for step in range(50): image = robot.get_image()
if step % 10 == 0: plan = vlm.plan(image, "navigate to the charging station") print(f"VLM plan: {plan}")
obs = robot.get_observation(modalities=["image", "lidar", "velocity"]) action = vla.predict(obs, instruction="go to charging station") robot.execute_action(action)
asyncio.run(hybrid_control())Performance
Section titled “Performance”| Metric | Value |
|---|---|
| VLM API latency (GPT-4o) | ~800ms per step |
| Image encoding time | ~5ms |
| Total loop time per step | ~1.2s (including motion) |
| Typical task completion | 3-8 steps |
| Token usage per step | ~800 input, ~50 output |