VLM 导航

3we 平台支持视觉语言模型 (VLM) 导航，其中自然语言指令通过相机观测被关联到机器人动作。这使得无需预构建地图或航点即可实现指令跟随行为，例如”去红色椅子那里”或”导航到厨房”。

流程概览

当您调用 robot.execute_instruction(...) 时，SDK 会启动一个感知-动作循环：

捕获 — 机器人从摄像头获取图像（robot.get_image()）
推理 — 图像和指令被发送到 GPT-4o（或任何兼容 OpenAI 的 VLM）
解析 — 模型返回结构化的 JSON 动作
执行 — 机器人执行动作（前进、旋转、停止）
重复 — 直到模型输出 "done" 或达到步数限制

┌─────────────────────────────────────────────────────────────┐
│                    Perception-Action Loop                     │
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│  │  Camera   │───▶│  VLM API │───▶│  Parser  │───▶│ Motor │ │
│  │  Image    │    │  GPT-4o  │    │  JSON    │    │  Cmd  │ │
│  └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│       ▲                                               │      │
│       └───────────────────────────────────────────────┘      │
│                         Loop until "done"                     │
└─────────────────────────────────────────────────────────────┘

基本示例

import asyncio
from threewe import Robot

async def main():
    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await robot.execute_instruction(
            "Navigate to the blue door on the left"
        )
        print(f"Success: {result.success}")
        print(f"Description: {result.description}")
        print(f"Images collected: {len(result.images)}")

asyncio.run(main())

直接使用 VLMRunner

如需更精细的控制，可以直接使用 VLMRunner 类：

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner

async def main():
    runner = VLMRunner(
        model="gpt-4o",
        max_steps=30,
        temperature=0.0,
    )

    async with Robot(backend="gazebo", scene="office_v2") as robot:
        image = robot.get_image()
        action_json = runner.plan(image, "go to the door on the left")
        print(f"VLM decided: {action_json}")

asyncio.run(main())

动作模式

VLM 每步输出一个 JSON 对象：

{
    "action": "move_forward",
    "distance": 0.5,
    "reason": "I can see a red bottle ahead on the right side"
}

支持的动作：

动作	参数	描述
`move_forward`	`distance`（米）	直线前进
`rotate_left`	`angle`（弧度）	逆时针旋转
`rotate_right`	`angle`（弧度）	顺时针旋转
`stop`	—	停止运动
`done`	—	任务完成

配置

VLM runner 可以通过环境变量进行配置：

# Use GPT-4o (default)
export OPENAI_API_KEY="sk-..."

# Or use Qwen-VL via a compatible endpoint
export THREEWE_VLM_MODEL="qwen-vl-max"
export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"

# Tune behavior
export THREEWE_VLM_MAX_STEPS=30
export THREEWE_VLM_TEMPERATURE=0.0

或者从环境变量创建 runner：

from threewe.ai.vlm_runner import VLMRunner

runner = VLMRunner.from_env()

模型选项

后端	模型	配置
OpenAI	GPT-4o, GPT-4-turbo	`OPENAI_API_KEY`
Qwen	Qwen-VL-Max, Qwen-VL-Plus	`THREEWE_VLM_BASE_URL` + 兼容密钥
本地	LLaVA, CogVLM	自定义 `base_url` 指向本地服务器
Azure OpenAI	GPT-4o	通过 `base_url` 使用 Azure 端点

任何支持 OpenAI chat completions API 且支持图像输入的模型均可直接使用。

逐步调试

on_step 回调让您可以查看模型在每一步的思考：

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction

def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None:
    print(f"\n--- Step {step_num} ---")
    print(f"  Raw: {raw_response}")
    print(f"  Action: {parsed_action.get('action', 'N/A')}")
    print(f"  Reason: {parsed_action.get('reason', 'N/A')}")

async def main():
    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await execute_vlm_instruction(
            robot,
            instruction="navigate to the kitchen area",
            model="gpt-4o",
            max_steps=25,
            on_step=log_step,
        )
        print(f"\nResult: success={result.success}")

asyncio.run(main())

中文支持

VLM runner 会检测指令中的 CJK 字符并相应调整系统提示词：

async with Robot(backend="gazebo") as robot:
    result = await robot.execute_instruction("找到红色的瓶子并停在它旁边")
    print(f"结果: {result.description}")

模型会使用中文来填写 reason 字段，同时保持动作键名为英文以确保可靠解析。

安全机制

VLM 导航器包含内置的安全层：

LiDAR 覆盖：无论 VLM 输出什么，如果障碍物在安全距离（默认 15cm）以内，机器人都会停止。
速度限制：硬编码最大线速度 0.5 m/s，最大角速度 1.0 rad/s。
通信看门狗：如果 200ms 内未收到有效动作，机器人停止。
步数限制：循环在 max_steps 次迭代后终止，以防止无限循环。

结合 VLM 与 VLA 模型

使用 VLM 进行高层推理，使用 VLA 进行精细的运动控制：

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
from threewe.ai.vla_runner import VLARunner

async def hybrid_control():
    vlm = VLMRunner(model="gpt-4o", max_steps=10)
    vla = VLARunner.from_pretrained("lerobot/act_3we_nav")

    async with Robot(backend="gazebo") as robot:
        for step in range(50):
            image = robot.get_image()

            if step % 10 == 0:
                plan = vlm.plan(image, "navigate to the charging station")
                print(f"VLM plan: {plan}")

            obs = robot.get_observation(modalities=["image", "lidar", "velocity"])
            action = vla.predict(obs, instruction="go to charging station")
            robot.execute_action(action)

asyncio.run(hybrid_control())

性能

指标	值
VLM API 延迟 (GPT-4o)	每步约 800ms
图像编码时间	约 5ms
每步总循环时间	约 1.2s（含运动）
典型任务完成步数	3-8 步
每步 token 用量	约 800 输入，约 50 输出