VLM Navigation

The 3we platform supports Vision-Language Model (VLM) navigation, where natural language instructions are grounded in camera observations to produce robot actions. This enables instruction-following behaviors like “go to the red chair” or “navigate to the kitchen” without pre-built maps or waypoints.

Pipeline Overview

When you call robot.execute_instruction(...), the SDK launches a perception-action loop:

Capture — The robot takes an image from its camera (robot.get_image())
Reason — The image + instruction are sent to GPT-4o (or any OpenAI-compatible VLM)
Parse — The model returns a structured JSON action
Execute — The robot executes the action (move forward, rotate, stop)
Repeat — Until the model outputs "done" or the step limit is reached

┌─────────────────────────────────────────────────────────────┐
│                    Perception-Action Loop                     │
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│  │  Camera   │───▶│  VLM API │───▶│  Parser  │───▶│ Motor │ │
│  │  Image    │    │  GPT-4o  │    │  JSON    │    │  Cmd  │ │
│  └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│       ▲                                               │      │
│       └───────────────────────────────────────────────┘      │
│                         Loop until "done"                     │
└─────────────────────────────────────────────────────────────┘

Basic Example

import asyncio
from threewe import Robot

async def main():
    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await robot.execute_instruction(
            "Navigate to the blue door on the left"
        )
        print(f"Success: {result.success}")
        print(f"Description: {result.description}")
        print(f"Images collected: {len(result.images)}")

asyncio.run(main())

Using VLMRunner Directly

For more control, use the VLMRunner class directly:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner

async def main():
    runner = VLMRunner(
        model="gpt-4o",
        max_steps=30,
        temperature=0.0,
    )

    async with Robot(backend="gazebo", scene="office_v2") as robot:
        image = robot.get_image()
        action_json = runner.plan(image, "go to the door on the left")
        print(f"VLM decided: {action_json}")

asyncio.run(main())

Action Schema

The VLM outputs exactly one JSON object per step:

{
    "action": "move_forward",
    "distance": 0.5,
    "reason": "I can see a red bottle ahead on the right side"
}

Supported actions:

Action	Parameters	Description
`move_forward`	`distance` (meters)	Drive straight ahead
`rotate_left`	`angle` (radians)	Turn counter-clockwise
`rotate_right`	`angle` (radians)	Turn clockwise
`stop`	—	Halt motion
`done`	—	Task is complete

Configuration

The VLM runner can be configured through environment variables:

# Use GPT-4o (default)
export OPENAI_API_KEY="sk-..."

# Or use Qwen-VL via a compatible endpoint
export THREEWE_VLM_MODEL="qwen-vl-max"
export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"

# Tune behavior
export THREEWE_VLM_MAX_STEPS=30
export THREEWE_VLM_TEMPERATURE=0.0

Or create a runner from environment variables:

from threewe.ai.vlm_runner import VLMRunner

runner = VLMRunner.from_env()

Model Options

Backend	Model	Config
OpenAI	GPT-4o, GPT-4-turbo	`OPENAI_API_KEY`
Qwen	Qwen-VL-Max, Qwen-VL-Plus	`THREEWE_VLM_BASE_URL` + compatible key
Local	LLaVA, CogVLM	Custom `base_url` pointing to local server
Azure OpenAI	GPT-4o	Azure endpoint via `base_url`

Any model that supports the OpenAI chat completions API with image inputs works out of the box.

Step-by-Step Debugging

The on_step callback lets you see what the model thinks at each step:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction

def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None:
    print(f"\n--- Step {step_num} ---")
    print(f"  Raw: {raw_response}")
    print(f"  Action: {parsed_action.get('action', 'N/A')}")
    print(f"  Reason: {parsed_action.get('reason', 'N/A')}")

async def main():
    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await execute_vlm_instruction(
            robot,
            instruction="navigate to the kitchen area",
            model="gpt-4o",
            max_steps=25,
            on_step=log_step,
        )
        print(f"\nResult: success={result.success}")

asyncio.run(main())

Chinese Language Support

The VLM runner detects CJK characters in instructions and adjusts its system prompt accordingly:

async with Robot(backend="gazebo") as robot:
    result = await robot.execute_instruction("找到红色的瓶子并停在它旁边")
    print(f"结果: {result.description}")

The model will reason in Chinese for the reason field while keeping action keys in English for reliable parsing.

Safety

The VLM navigator includes built-in safety layers:

LiDAR override: Regardless of VLM output, the robot stops if an obstacle is within the safety distance (15cm default).
Velocity limits: Hard-coded max 0.5 m/s linear, 1.0 rad/s angular.
Communication watchdog: If no valid action is received within 200ms, the robot halts.
Step limit: The loop terminates after max_steps iterations to prevent infinite loops.

Combining VLM with VLA Models

Use the VLM for high-level reasoning and a VLA for fine-grained motor control:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
from threewe.ai.vla_runner import VLARunner

async def hybrid_control():
    vlm = VLMRunner(model="gpt-4o", max_steps=10)
    vla = VLARunner.from_pretrained("lerobot/act_3we_nav")

    async with Robot(backend="gazebo") as robot:
        for step in range(50):
            image = robot.get_image()

            if step % 10 == 0:
                plan = vlm.plan(image, "navigate to the charging station")
                print(f"VLM plan: {plan}")

            obs = robot.get_observation(modalities=["image", "lidar", "velocity"])
            action = vla.predict(obs, instruction="go to charging station")
            robot.execute_action(action)

asyncio.run(hybrid_control())

Performance

Metric	Value
VLM API latency (GPT-4o)	~800ms per step
Image encoding time	~5ms
Total loop time per step	~1.2s (including motion)
Typical task completion	3-8 steps
Token usage per step	~800 input, ~50 output