Skip to content

VLM Navigation

The 3we platform supports Vision-Language Model (VLM) navigation, where natural language instructions are grounded in camera observations to produce robot actions. This enables instruction-following behaviors like “go to the red chair” or “navigate to the kitchen” without pre-built maps or waypoints.

When you call robot.execute_instruction(...), the SDK launches a perception-action loop:

  1. Capture — The robot takes an image from its camera (robot.get_image())
  2. Reason — The image + instruction are sent to GPT-4o (or any OpenAI-compatible VLM)
  3. Parse — The model returns a structured JSON action
  4. Execute — The robot executes the action (move forward, rotate, stop)
  5. Repeat — Until the model outputs "done" or the step limit is reached
┌─────────────────────────────────────────────────────────────┐
│ Perception-Action Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ Camera │───▶│ VLM API │───▶│ Parser │───▶│ Motor │ │
│ │ Image │ │ GPT-4o │ │ JSON │ │ Cmd │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────┘ │
│ ▲ │ │
│ └───────────────────────────────────────────────┘ │
│ Loop until "done" │
└─────────────────────────────────────────────────────────────┘
import asyncio
from threewe import Robot
async def main():
async with Robot(backend="gazebo", scene="office_v2") as robot:
result = await robot.execute_instruction(
"Navigate to the blue door on the left"
)
print(f"Success: {result.success}")
print(f"Description: {result.description}")
print(f"Images collected: {len(result.images)}")
asyncio.run(main())

For more control, use the VLMRunner class directly:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
async def main():
runner = VLMRunner(
model="gpt-4o",
max_steps=30,
temperature=0.0,
)
async with Robot(backend="gazebo", scene="office_v2") as robot:
image = robot.get_image()
action_json = runner.plan(image, "go to the door on the left")
print(f"VLM decided: {action_json}")
asyncio.run(main())

The VLM outputs exactly one JSON object per step:

{
"action": "move_forward",
"distance": 0.5,
"reason": "I can see a red bottle ahead on the right side"
}

Supported actions:

ActionParametersDescription
move_forwarddistance (meters)Drive straight ahead
rotate_leftangle (radians)Turn counter-clockwise
rotate_rightangle (radians)Turn clockwise
stopHalt motion
doneTask is complete

The VLM runner can be configured through environment variables:

Terminal window
# Use GPT-4o (default)
export OPENAI_API_KEY="sk-..."
# Or use Qwen-VL via a compatible endpoint
export THREEWE_VLM_MODEL="qwen-vl-max"
export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
# Tune behavior
export THREEWE_VLM_MAX_STEPS=30
export THREEWE_VLM_TEMPERATURE=0.0

Or create a runner from environment variables:

from threewe.ai.vlm_runner import VLMRunner
runner = VLMRunner.from_env()
BackendModelConfig
OpenAIGPT-4o, GPT-4-turboOPENAI_API_KEY
QwenQwen-VL-Max, Qwen-VL-PlusTHREEWE_VLM_BASE_URL + compatible key
LocalLLaVA, CogVLMCustom base_url pointing to local server
Azure OpenAIGPT-4oAzure endpoint via base_url

Any model that supports the OpenAI chat completions API with image inputs works out of the box.

The on_step callback lets you see what the model thinks at each step:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction
def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None:
print(f"\n--- Step {step_num} ---")
print(f" Raw: {raw_response}")
print(f" Action: {parsed_action.get('action', 'N/A')}")
print(f" Reason: {parsed_action.get('reason', 'N/A')}")
async def main():
async with Robot(backend="gazebo", scene="office_v2") as robot:
result = await execute_vlm_instruction(
robot,
instruction="navigate to the kitchen area",
model="gpt-4o",
max_steps=25,
on_step=log_step,
)
print(f"\nResult: success={result.success}")
asyncio.run(main())

The VLM runner detects CJK characters in instructions and adjusts its system prompt accordingly:

async with Robot(backend="gazebo") as robot:
result = await robot.execute_instruction("找到红色的瓶子并停在它旁边")
print(f"结果: {result.description}")

The model will reason in Chinese for the reason field while keeping action keys in English for reliable parsing.

The VLM navigator includes built-in safety layers:

  • LiDAR override: Regardless of VLM output, the robot stops if an obstacle is within the safety distance (15cm default).
  • Velocity limits: Hard-coded max 0.5 m/s linear, 1.0 rad/s angular.
  • Communication watchdog: If no valid action is received within 200ms, the robot halts.
  • Step limit: The loop terminates after max_steps iterations to prevent infinite loops.

Use the VLM for high-level reasoning and a VLA for fine-grained motor control:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
from threewe.ai.vla_runner import VLARunner
async def hybrid_control():
vlm = VLMRunner(model="gpt-4o", max_steps=10)
vla = VLARunner.from_pretrained("lerobot/act_3we_nav")
async with Robot(backend="gazebo") as robot:
for step in range(50):
image = robot.get_image()
if step % 10 == 0:
plan = vlm.plan(image, "navigate to the charging station")
print(f"VLM plan: {plan}")
obs = robot.get_observation(modalities=["image", "lidar", "velocity"])
action = vla.predict(obs, instruction="go to charging station")
robot.execute_action(action)
asyncio.run(hybrid_control())
MetricValue
VLM API latency (GPT-4o)~800ms per step
Image encoding time~5ms
Total loop time per step~1.2s (including motion)
Typical task completion3-8 steps
Token usage per step~800 input, ~50 output