30 行 Python：让 GPT-4o 控制真实机器人

如果你可以给机器人一个自然语言指令，让它自主理解、规划并执行动作——全部只需 30 行 Python 代码，会怎样？无需 ROS2 知识，无需自定义规划器，无需复杂的提示工程。

使用 threewe 包，这不是假设。它今天就能运行，在 Gazebo 仿真和真实硬件上使用相同的代码。

import asyncio
from threewe import Robot

async def main():
    async with Robot(backend="gazebo") as robot:
        result = await robot.execute_instruction("find the red bottle and stop near it")
        print(f"Success: {result.success}")
        print(f"Description: {result.description}")

asyncio.run(main())

这就是完整的程序。机器人通过摄像头观察环境，使用 GPT-4o 对所见内容进行推理，并执行运动命令直到任务完成。让我们来分解内部发生了什么。

VLM 感知-动作循环

当你调用 robot.execute_instruction(...) 时，SDK 在内部启动一个紧凑的感知-动作循环。每次迭代：

捕获 — 机器人从摄像头拍摄图像（robot.get_image()）
推理 — 图像 + 指令被发送到 GPT-4o（或任何 OpenAI 兼容的 VLM）
解析 — 模型返回结构化的 JSON 动作
执行 — 机器人执行动作（前进、旋转、停止）
重复 — 直到模型输出 "done" 或达到步数上限

此循环默认最多运行 20 步，可按每次调用进行配置。

架构图

┌─────────────────────────────────────────────────────────────┐
│                    Perception-Action Loop                     │
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────┐ │
│  │  Camera   │───▶│  VLM API │───▶│  Parser  │───▶│ Motor │ │
│  │  Image    │    │  GPT-4o  │    │  JSON    │    │  Cmd  │ │
│  └──────────┘    └──────────┘    └──────────┘    └───────┘ │
│       ▲                                               │      │
│       └───────────────────────────────────────────────┘      │
│                         Loop until "done"                     │
└─────────────────────────────────────────────────────────────┘

理解 VLMRunner

在底层，execute_instruction 委托给 VLMRunner 类。你可以直接使用它来获得更多控制：

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner

async def main():
    runner = VLMRunner(
        model="gpt-4o",
        max_steps=30,
        temperature=0.0,
    )

    async with Robot(backend="gazebo", scene="office_v2") as robot:
        # Single-step reasoning: get one action plan from current view
        image = robot.get_image()
        action_json = runner.plan(image, "go to the door on the left")
        print(f"VLM decided: {action_json}")

asyncio.run(main())

VLMRunner.plan() 方法执行单次感知-推理步骤。它将摄像头图像编码为 base64 JPEG，构造一个约束模型输出结构化 JSON 的系统提示，并返回原始响应。

动作模式

VLM 被指示每步输出恰好一个 JSON 对象：

{
    "action": "move_forward",
    "distance": 0.5,
    "reason": "I can see a red bottle ahead on the right side"
}

支持的动作：

动作	参数	描述
`move_forward`	`distance`（米）	直线前进
`rotate_left`	`angle`（弧度）	逆时针旋转
`rotate_right`	`angle`（弧度）	顺时针旋转
`stop`	—	停止运动
`done`	—	任务完成

结构化 JSON 输出的约束意味着不需要自由文本解析。模型要么返回有效的 JSON，要么该步骤被跳过。

通过环境变量配置

VLM runner 可以完全通过环境变量进行配置，轻松切换模型而无需修改代码：

# Use GPT-4o (default)
export OPENAI_API_KEY="sk-..."

# Or use Qwen-VL via a compatible endpoint
export THREEWE_VLM_MODEL="qwen-vl-max"
export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export OPENAI_API_KEY="sk-..."

# Tune behavior
export THREEWE_VLM_MAX_STEPS=30
export THREEWE_VLM_TEMPERATURE=0.0

你也可以直接从环境变量创建 runner：

from threewe.ai.vlm_runner import VLMRunner

runner = VLMRunner.from_env()

逐步回调调试

在开发 VLM 控制行为时，你希望看到模型在每一步的思考过程。on_step 回调提供了这个功能：

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction

def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None:
    print(f"\n--- Step {step_num} ---")
    print(f"  Raw: {raw_response}")
    print(f"  Action: {parsed_action.get('action', 'N/A')}")
    print(f"  Reason: {parsed_action.get('reason', 'N/A')}")

async def main():
    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await execute_vlm_instruction(
            robot,
            instruction="navigate to the kitchen area",
            model="gpt-4o",
            max_steps=25,
            on_step=log_step,
        )
        print(f"\nFinal result: success={result.success}")
        print(f"Description: {result.description}")
        print(f"Images collected: {len(result.images)}")

asyncio.run(main())

执行过程中的输出如下：

--- Step 0 ---
  Raw: {"action": "rotate_left", "angle": 0.8, "reason": "Looking for kitchen, turning to scan the room"}
  Action: rotate_left
  Reason: Looking for kitchen, turning to scan the room

--- Step 1 ---
  Raw: {"action": "move_forward", "distance": 1.0, "reason": "I can see a counter and sink ahead, moving toward kitchen"}
  Action: move_forward
  Reason: I can see a counter and sink ahead, moving toward kitchen

--- Step 2 ---
  Raw: {"action": "move_forward", "distance": 0.5, "reason": "Getting closer to the kitchen counter"}
  Action: move_forward
  Reason: Getting closer to the kitchen counter

--- Step 3 ---
  Raw: {"action": "done", "reason": "I am now in the kitchen area near the counter"}
  Action: done
  Reason: I am now in the kitchen area near the counter

Final result: success=True
Description: I am now in the kitchen area near the counter
Images collected: 4

运行时实际发生了什么

这里是一个具体的例子。指令：“find the red bottle and stop near it.”

第 0 步：机器人捕获第一张图像。它看到一个有桌子、椅子和架子的办公室。VLM 回复：rotate_left, angle=0.6, reason="scanning room for red bottle"。

第 1 步：旋转后，摄像头现在显示不同的角度。可以看到一个有各种物品的架子。VLM 回复：move_forward, distance=1.2, reason="I can see something red on the shelf ahead"。

第 2 步：现在更近了。VLM 可以清楚地识别出架子第二层上的红色瓶子。它回复：move_forward, distance=0.8, reason="approaching the red bottle on the shelf"。

第 3 步：机器人现在距离架子大约 0.3m。VLM 回复：done, reason="I am near the red bottle on the shelf"。

总执行时间：4 次 VLM 调用，大约 3 秒的 API 时间，2.5 秒的机器人运动。ExecutionResult 对象包含 success=True、描述文本，以及执行过程中收集的全部 4 张摄像头图像。

Sim2Real：真实硬件上的相同代码

完全相同的脚本可以在真实硬件上运行。只需更改一个参数：

# In simulation
async with Robot(backend="gazebo") as robot:
    result = await robot.execute_instruction("find the red bottle")

# On real hardware -- literally the only change
async with Robot(backend="real") as robot:
    result = await robot.execute_instruction("find the red bottle")

VLM 推理是相同的，因为它基于摄像头图像进行操作，与图像来源无关。运动命令（move_forward、rotate）由后端抽象层执行，该层在真实硬件上映射到 ROS2/Nav2，在仿真中映射到 Gazebo 物理引擎。

中文支持

VLM runner 会检测指令中的中日韩字符，并相应调整其系统提示：

async with Robot(backend="gazebo") as robot:
    result = await robot.execute_instruction("找到红色的瓶子并停在它旁边")
    print(f"结果: {result.description}")

模型将在 reason 字段中使用中文进行推理，同时保持动作键（action keys）为英文以确保可靠解析。

VLM 与 VLA 模型结合

对于研究工作流，你可能希望使用 VLM 进行高层推理，使用 VLA（Vision-Language-Action）模型进行精细运动控制：

import asyncio
import numpy as np
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
from threewe.ai.vla_runner import VLARunner

async def hybrid_control():
    vlm = VLMRunner(model="gpt-4o", max_steps=10)
    vla = VLARunner.from_pretrained("lerobot/act_3we_nav")

    async with Robot(backend="gazebo") as robot:
        for step in range(50):
            image = robot.get_image()

            # High-level: ask VLM what to do
            if step % 10 == 0:
                plan = vlm.plan(image, "navigate to the charging station")
                print(f"VLM plan: {plan}")

            # Low-level: VLA generates smooth motor commands
            obs = robot.get_observation(modalities=["image", "lidar", "velocity"])
            action = vla.predict(obs, instruction="go to charging station")
            robot.execute_action(action)

asyncio.run(hybrid_control())

这种模式将 VLM 作为每 N 步触发一次的”战略顾问”，而 VLA 负责逐帧处理运动命令并生成平滑轨迹。

性能考虑

指标	数值
VLM API 延迟（GPT-4o）	每步约 800ms
图像编码时间	约 5ms
每步总循环时间	约 1.2s（含运动）
典型任务完成步数	3-8 步
每步 Token 消耗	约 800 输入，约 50 输出

对于延迟敏感的应用，考虑：

使用 max_steps=10 限定执行时间上限
切换到本地 VLM（Qwen-VL、LLaVA），通过 THREEWE_VLM_BASE_URL 配置
缓存相似场景的 VLM 响应
使用 VLA runner 进行反应式控制，仅在需要重新规划时调用 VLM

错误处理

循环设计为具有弹性：

import asyncio
from threewe import Robot, NavigationError, TimeoutError

async def robust_vlm_control():
    async with Robot(backend="gazebo") as robot:
        try:
            result = await robot.execute_instruction(
                "find the exit door and stop in front of it"
            )
            if result.success:
                print(f"Task completed: {result.description}")
            else:
                print(f"Task failed after max steps: {result.description}")
                # result.images contains all camera frames for analysis
                print(f"Collected {len(result.images)} frames for post-mortem")
        except NavigationError as e:
            print(f"Navigation failed: {e}")
        except TimeoutError as e:
            print(f"Operation timed out: {e}")

asyncio.run(robust_vlm_control())

如果 VLM 返回无效的 JSON，该步骤会被跳过，循环继续。如果导航动作失败（障碍物、超时），下一个 VLM 步骤会看到当前状态并可以自适应调整。

支持的 VLM 后端

后端	模型	配置
OpenAI	GPT-4o, GPT-4-turbo	`OPENAI_API_KEY`
Qwen	Qwen-VL-Max, Qwen-VL-Plus	`THREEWE_VLM_BASE_URL` + 兼容密钥
本地	LLaVA, CogVLM	自定义 `base_url` 指向本地服务
Azure OpenAI	GPT-4o	通过 `base_url` 配置 Azure 端点

任何支持 OpenAI chat completions API 且接受图像输入的模型均可开箱即用。

完整可运行示例：物体搜索

以下是一个完整的、可运行的示例，它搜索一个物体，记录每一步，并保存轨迹图像：

"""VLM-controlled object search.

Requirements:
    pip install threewe[ai]
    export OPENAI_API_KEY="sk-..."

Usage:
    python vlm_search.py
"""

import asyncio
import json
from pathlib import Path

from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction

INSTRUCTION = "find the red bottle and stop near it"
OUTPUT_DIR = Path("vlm_results")


def on_step(step: int, raw: str, action: dict) -> None:
    print(f"[Step {step:02d}] {action.get('action', '?'):14s} | {action.get('reason', '')}")


async def main() -> None:
    OUTPUT_DIR.mkdir(exist_ok=True)

    async with Robot(backend="gazebo", scene="office_v2") as robot:
        result = await execute_vlm_instruction(
            robot,
            instruction=INSTRUCTION,
            model="gpt-4o",
            max_steps=20,
            on_step=on_step,
        )

    print(f"\n{'='*50}")
    print(f"Success: {result.success}")
    print(f"Description: {result.description}")
    print(f"Steps taken: {len(result.images)}")

    # Save trajectory images for analysis
    for i, img in enumerate(result.images):
        path = OUTPUT_DIR / f"step_{i:03d}.jpg"
        try:
            from PIL import Image
            import numpy as np

            pil_img = Image.fromarray(img[:, :, ::-1])  # BGR to RGB
            pil_img.save(str(path))
        except ImportError:
            pass

    print(f"Images saved to: {OUTPUT_DIR}/")


if __name__ == "__main__":
    asyncio.run(main())

快速开始

1. 安装

pip install threewe[ai]

这会安装核心 SDK 以及 VLM 集成所需的 openai 和 Pillow 依赖。

2. 设置 API 密钥

export OPENAI_API_KEY="sk-your-key-here"

3. 启动仿真（可选，用于测试）

threewe launch --backend gazebo --scene office_v2

4. 运行脚本

import asyncio
from threewe import Robot

async def main():
    async with Robot(backend="gazebo") as robot:
        result = await robot.execute_instruction("find the red bottle and stop near it")
        print(f"Done! Success={result.success}: {result.description}")

asyncio.run(main())

5. 切换到真实硬件

async with Robot(backend="real") as robot:
    result = await robot.execute_instruction("find the red bottle and stop near it")

无需其他代码更改。

接下来

这是进入具身智能最简单的入口。从这里，你可以：

使用 VLM 执行收集的轨迹训练 VLA 模型
使用 Gymnasium 环境（threewe.gym）进行强化学习训练
运行基准测试（threewe benchmark run --task objectnav）来衡量你的智能体
向社区排行榜提交结果

关键洞察是：你不需要理解 ROS2、Nav2、SLAM 或电机控制就能让机器人通过语言完成有用的任务。threewe SDK 将所有这些都抽象在一个 Python 原生接口之后。