跳转到内容

30 行 Python:让 GPT-4o 控制真实机器人

如果你可以给机器人一个自然语言指令,让它自主理解、规划并执行动作——全部只需 30 行 Python 代码,会怎样?无需 ROS2 知识,无需自定义规划器,无需复杂的提示工程。

使用 threewe 包,这不是假设。它今天就能运行,在 Gazebo 仿真和真实硬件上使用相同的代码。

import asyncio
from threewe import Robot
async def main():
async with Robot(backend="gazebo") as robot:
result = await robot.execute_instruction("find the red bottle and stop near it")
print(f"Success: {result.success}")
print(f"Description: {result.description}")
asyncio.run(main())

这就是完整的程序。机器人通过摄像头观察环境,使用 GPT-4o 对所见内容进行推理,并执行运动命令直到任务完成。让我们来分解内部发生了什么。


当你调用 robot.execute_instruction(...) 时,SDK 在内部启动一个紧凑的感知-动作循环。每次迭代:

  1. 捕获 — 机器人从摄像头拍摄图像(robot.get_image()
  2. 推理 — 图像 + 指令被发送到 GPT-4o(或任何 OpenAI 兼容的 VLM)
  3. 解析 — 模型返回结构化的 JSON 动作
  4. 执行 — 机器人执行动作(前进、旋转、停止)
  5. 重复 — 直到模型输出 "done" 或达到步数上限

此循环默认最多运行 20 步,可按每次调用进行配置。

┌─────────────────────────────────────────────────────────────┐
│ Perception-Action Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │
│ │ Camera │───▶│ VLM API │───▶│ Parser │───▶│ Motor │ │
│ │ Image │ │ GPT-4o │ │ JSON │ │ Cmd │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────┘ │
│ ▲ │ │
│ └───────────────────────────────────────────────┘ │
│ Loop until "done" │
└─────────────────────────────────────────────────────────────┘

在底层,execute_instruction 委托给 VLMRunner 类。你可以直接使用它来获得更多控制:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
async def main():
runner = VLMRunner(
model="gpt-4o",
max_steps=30,
temperature=0.0,
)
async with Robot(backend="gazebo", scene="office_v2") as robot:
# Single-step reasoning: get one action plan from current view
image = robot.get_image()
action_json = runner.plan(image, "go to the door on the left")
print(f"VLM decided: {action_json}")
asyncio.run(main())

VLMRunner.plan() 方法执行单次感知-推理步骤。它将摄像头图像编码为 base64 JPEG,构造一个约束模型输出结构化 JSON 的系统提示,并返回原始响应。

VLM 被指示每步输出恰好一个 JSON 对象:

{
"action": "move_forward",
"distance": 0.5,
"reason": "I can see a red bottle ahead on the right side"
}

支持的动作:

动作参数描述
move_forwarddistance(米)直线前进
rotate_leftangle(弧度)逆时针旋转
rotate_rightangle(弧度)顺时针旋转
stop停止运动
done任务完成

结构化 JSON 输出的约束意味着不需要自由文本解析。模型要么返回有效的 JSON,要么该步骤被跳过。


VLM runner 可以完全通过环境变量进行配置,轻松切换模型而无需修改代码:

Terminal window
# Use GPT-4o (default)
export OPENAI_API_KEY="sk-..."
# Or use Qwen-VL via a compatible endpoint
export THREEWE_VLM_MODEL="qwen-vl-max"
export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
export OPENAI_API_KEY="sk-..."
# Tune behavior
export THREEWE_VLM_MAX_STEPS=30
export THREEWE_VLM_TEMPERATURE=0.0

你也可以直接从环境变量创建 runner:

from threewe.ai.vlm_runner import VLMRunner
runner = VLMRunner.from_env()

在开发 VLM 控制行为时,你希望看到模型在每一步的思考过程。on_step 回调提供了这个功能:

import asyncio
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction
def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None:
print(f"\n--- Step {step_num} ---")
print(f" Raw: {raw_response}")
print(f" Action: {parsed_action.get('action', 'N/A')}")
print(f" Reason: {parsed_action.get('reason', 'N/A')}")
async def main():
async with Robot(backend="gazebo", scene="office_v2") as robot:
result = await execute_vlm_instruction(
robot,
instruction="navigate to the kitchen area",
model="gpt-4o",
max_steps=25,
on_step=log_step,
)
print(f"\nFinal result: success={result.success}")
print(f"Description: {result.description}")
print(f"Images collected: {len(result.images)}")
asyncio.run(main())

执行过程中的输出如下:

--- Step 0 ---
Raw: {"action": "rotate_left", "angle": 0.8, "reason": "Looking for kitchen, turning to scan the room"}
Action: rotate_left
Reason: Looking for kitchen, turning to scan the room
--- Step 1 ---
Raw: {"action": "move_forward", "distance": 1.0, "reason": "I can see a counter and sink ahead, moving toward kitchen"}
Action: move_forward
Reason: I can see a counter and sink ahead, moving toward kitchen
--- Step 2 ---
Raw: {"action": "move_forward", "distance": 0.5, "reason": "Getting closer to the kitchen counter"}
Action: move_forward
Reason: Getting closer to the kitchen counter
--- Step 3 ---
Raw: {"action": "done", "reason": "I am now in the kitchen area near the counter"}
Action: done
Reason: I am now in the kitchen area near the counter
Final result: success=True
Description: I am now in the kitchen area near the counter
Images collected: 4

这里是一个具体的例子。指令:“find the red bottle and stop near it.”

第 0 步:机器人捕获第一张图像。它看到一个有桌子、椅子和架子的办公室。VLM 回复:rotate_left, angle=0.6, reason="scanning room for red bottle"

第 1 步:旋转后,摄像头现在显示不同的角度。可以看到一个有各种物品的架子。VLM 回复:move_forward, distance=1.2, reason="I can see something red on the shelf ahead"

第 2 步:现在更近了。VLM 可以清楚地识别出架子第二层上的红色瓶子。它回复:move_forward, distance=0.8, reason="approaching the red bottle on the shelf"

第 3 步:机器人现在距离架子大约 0.3m。VLM 回复:done, reason="I am near the red bottle on the shelf"

总执行时间:4 次 VLM 调用,大约 3 秒的 API 时间,2.5 秒的机器人运动。ExecutionResult 对象包含 success=True、描述文本,以及执行过程中收集的全部 4 张摄像头图像。


完全相同的脚本可以在真实硬件上运行。只需更改一个参数:

# In simulation
async with Robot(backend="gazebo") as robot:
result = await robot.execute_instruction("find the red bottle")
# On real hardware -- literally the only change
async with Robot(backend="real") as robot:
result = await robot.execute_instruction("find the red bottle")

VLM 推理是相同的,因为它基于摄像头图像进行操作,与图像来源无关。运动命令(move_forwardrotate)由后端抽象层执行,该层在真实硬件上映射到 ROS2/Nav2,在仿真中映射到 Gazebo 物理引擎。


VLM runner 会检测指令中的中日韩字符,并相应调整其系统提示:

async with Robot(backend="gazebo") as robot:
result = await robot.execute_instruction("找到红色的瓶子并停在它旁边")
print(f"结果: {result.description}")

模型将在 reason 字段中使用中文进行推理,同时保持动作键(action keys)为英文以确保可靠解析。


对于研究工作流,你可能希望使用 VLM 进行高层推理,使用 VLA(Vision-Language-Action)模型进行精细运动控制:

import asyncio
import numpy as np
from threewe import Robot
from threewe.ai.vlm_runner import VLMRunner
from threewe.ai.vla_runner import VLARunner
async def hybrid_control():
vlm = VLMRunner(model="gpt-4o", max_steps=10)
vla = VLARunner.from_pretrained("lerobot/act_3we_nav")
async with Robot(backend="gazebo") as robot:
for step in range(50):
image = robot.get_image()
# High-level: ask VLM what to do
if step % 10 == 0:
plan = vlm.plan(image, "navigate to the charging station")
print(f"VLM plan: {plan}")
# Low-level: VLA generates smooth motor commands
obs = robot.get_observation(modalities=["image", "lidar", "velocity"])
action = vla.predict(obs, instruction="go to charging station")
robot.execute_action(action)
asyncio.run(hybrid_control())

这种模式将 VLM 作为每 N 步触发一次的”战略顾问”,而 VLA 负责逐帧处理运动命令并生成平滑轨迹。


指标数值
VLM API 延迟(GPT-4o)每步约 800ms
图像编码时间约 5ms
每步总循环时间约 1.2s(含运动)
典型任务完成步数3-8 步
每步 Token 消耗约 800 输入,约 50 输出

对于延迟敏感的应用,考虑:

  • 使用 max_steps=10 限定执行时间上限
  • 切换到本地 VLM(Qwen-VL、LLaVA),通过 THREEWE_VLM_BASE_URL 配置
  • 缓存相似场景的 VLM 响应
  • 使用 VLA runner 进行反应式控制,仅在需要重新规划时调用 VLM

循环设计为具有弹性:

import asyncio
from threewe import Robot, NavigationError, TimeoutError
async def robust_vlm_control():
async with Robot(backend="gazebo") as robot:
try:
result = await robot.execute_instruction(
"find the exit door and stop in front of it"
)
if result.success:
print(f"Task completed: {result.description}")
else:
print(f"Task failed after max steps: {result.description}")
# result.images contains all camera frames for analysis
print(f"Collected {len(result.images)} frames for post-mortem")
except NavigationError as e:
print(f"Navigation failed: {e}")
except TimeoutError as e:
print(f"Operation timed out: {e}")
asyncio.run(robust_vlm_control())

如果 VLM 返回无效的 JSON,该步骤会被跳过,循环继续。如果导航动作失败(障碍物、超时),下一个 VLM 步骤会看到当前状态并可以自适应调整。


后端模型配置
OpenAIGPT-4o, GPT-4-turboOPENAI_API_KEY
QwenQwen-VL-Max, Qwen-VL-PlusTHREEWE_VLM_BASE_URL + 兼容密钥
本地LLaVA, CogVLM自定义 base_url 指向本地服务
Azure OpenAIGPT-4o通过 base_url 配置 Azure 端点

任何支持 OpenAI chat completions API 且接受图像输入的模型均可开箱即用。


以下是一个完整的、可运行的示例,它搜索一个物体,记录每一步,并保存轨迹图像:

"""VLM-controlled object search.
Requirements:
pip install threewe[ai]
export OPENAI_API_KEY="sk-..."
Usage:
python vlm_search.py
"""
import asyncio
import json
from pathlib import Path
from threewe import Robot
from threewe.ai.vlm_runner import execute_vlm_instruction
INSTRUCTION = "find the red bottle and stop near it"
OUTPUT_DIR = Path("vlm_results")
def on_step(step: int, raw: str, action: dict) -> None:
print(f"[Step {step:02d}] {action.get('action', '?'):14s} | {action.get('reason', '')}")
async def main() -> None:
OUTPUT_DIR.mkdir(exist_ok=True)
async with Robot(backend="gazebo", scene="office_v2") as robot:
result = await execute_vlm_instruction(
robot,
instruction=INSTRUCTION,
model="gpt-4o",
max_steps=20,
on_step=on_step,
)
print(f"\n{'='*50}")
print(f"Success: {result.success}")
print(f"Description: {result.description}")
print(f"Steps taken: {len(result.images)}")
# Save trajectory images for analysis
for i, img in enumerate(result.images):
path = OUTPUT_DIR / f"step_{i:03d}.jpg"
try:
from PIL import Image
import numpy as np
pil_img = Image.fromarray(img[:, :, ::-1]) # BGR to RGB
pil_img.save(str(path))
except ImportError:
pass
print(f"Images saved to: {OUTPUT_DIR}/")
if __name__ == "__main__":
asyncio.run(main())

Terminal window
pip install threewe[ai]

这会安装核心 SDK 以及 VLM 集成所需的 openaiPillow 依赖。

Terminal window
export OPENAI_API_KEY="sk-your-key-here"

3. 启动仿真(可选,用于测试)

Section titled “3. 启动仿真(可选,用于测试)”
Terminal window
threewe launch --backend gazebo --scene office_v2
import asyncio
from threewe import Robot
async def main():
async with Robot(backend="gazebo") as robot:
result = await robot.execute_instruction("find the red bottle and stop near it")
print(f"Done! Success={result.success}: {result.description}")
asyncio.run(main())
async with Robot(backend="real") as robot:
result = await robot.execute_instruction("find the red bottle and stop near it")

无需其他代码更改。


这是进入具身智能最简单的入口。从这里,你可以:

  • 使用 VLM 执行收集的轨迹训练 VLA 模型
  • 使用 Gymnasium 环境(threewe.gym)进行强化学习训练
  • 运行基准测试(threewe benchmark run --task objectnav)来衡量你的智能体
  • 向社区排行榜提交结果

关键洞察是:你不需要理解 ROS2、Nav2、SLAM 或电机控制就能让机器人通过语言完成有用的任务。threewe SDK 将所有这些都抽象在一个 Python 原生接口之后。