30 行 Python:让 GPT-4o 控制真实机器人
如果你可以给机器人一个自然语言指令,让它自主理解、规划并执行动作——全部只需 30 行 Python 代码,会怎样?无需 ROS2 知识,无需自定义规划器,无需复杂的提示工程。
使用 threewe 包,这不是假设。它今天就能运行,在 Gazebo 仿真和真实硬件上使用相同的代码。
import asynciofrom threewe import Robot
async def main(): async with Robot(backend="gazebo") as robot: result = await robot.execute_instruction("find the red bottle and stop near it") print(f"Success: {result.success}") print(f"Description: {result.description}")
asyncio.run(main())这就是完整的程序。机器人通过摄像头观察环境,使用 GPT-4o 对所见内容进行推理,并执行运动命令直到任务完成。让我们来分解内部发生了什么。
VLM 感知-动作循环
Section titled “VLM 感知-动作循环”当你调用 robot.execute_instruction(...) 时,SDK 在内部启动一个紧凑的感知-动作循环。每次迭代:
- 捕获 — 机器人从摄像头拍摄图像(
robot.get_image()) - 推理 — 图像 + 指令被发送到 GPT-4o(或任何 OpenAI 兼容的 VLM)
- 解析 — 模型返回结构化的 JSON 动作
- 执行 — 机器人执行动作(前进、旋转、停止)
- 重复 — 直到模型输出
"done"或达到步数上限
此循环默认最多运行 20 步,可按每次调用进行配置。
┌─────────────────────────────────────────────────────────────┐│ Perception-Action Loop ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ ││ │ Camera │───▶│ VLM API │───▶│ Parser │───▶│ Motor │ ││ │ Image │ │ GPT-4o │ │ JSON │ │ Cmd │ ││ └──────────┘ └──────────┘ └──────────┘ └───────┘ ││ ▲ │ ││ └───────────────────────────────────────────────┘ ││ Loop until "done" │└─────────────────────────────────────────────────────────────┘理解 VLMRunner
Section titled “理解 VLMRunner”在底层,execute_instruction 委托给 VLMRunner 类。你可以直接使用它来获得更多控制:
import asynciofrom threewe import Robotfrom threewe.ai.vlm_runner import VLMRunner
async def main(): runner = VLMRunner( model="gpt-4o", max_steps=30, temperature=0.0, )
async with Robot(backend="gazebo", scene="office_v2") as robot: # Single-step reasoning: get one action plan from current view image = robot.get_image() action_json = runner.plan(image, "go to the door on the left") print(f"VLM decided: {action_json}")
asyncio.run(main())VLMRunner.plan() 方法执行单次感知-推理步骤。它将摄像头图像编码为 base64 JPEG,构造一个约束模型输出结构化 JSON 的系统提示,并返回原始响应。
VLM 被指示每步输出恰好一个 JSON 对象:
{ "action": "move_forward", "distance": 0.5, "reason": "I can see a red bottle ahead on the right side"}支持的动作:
| 动作 | 参数 | 描述 |
|---|---|---|
move_forward | distance(米) | 直线前进 |
rotate_left | angle(弧度) | 逆时针旋转 |
rotate_right | angle(弧度) | 顺时针旋转 |
stop | — | 停止运动 |
done | — | 任务完成 |
结构化 JSON 输出的约束意味着不需要自由文本解析。模型要么返回有效的 JSON,要么该步骤被跳过。
通过环境变量配置
Section titled “通过环境变量配置”VLM runner 可以完全通过环境变量进行配置,轻松切换模型而无需修改代码:
# Use GPT-4o (default)export OPENAI_API_KEY="sk-..."
# Or use Qwen-VL via a compatible endpointexport THREEWE_VLM_MODEL="qwen-vl-max"export THREEWE_VLM_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"export OPENAI_API_KEY="sk-..."
# Tune behaviorexport THREEWE_VLM_MAX_STEPS=30export THREEWE_VLM_TEMPERATURE=0.0你也可以直接从环境变量创建 runner:
from threewe.ai.vlm_runner import VLMRunner
runner = VLMRunner.from_env()逐步回调调试
Section titled “逐步回调调试”在开发 VLM 控制行为时,你希望看到模型在每一步的思考过程。on_step 回调提供了这个功能:
import asynciofrom threewe import Robotfrom threewe.ai.vlm_runner import execute_vlm_instruction
def log_step(step_num: int, raw_response: str, parsed_action: dict) -> None: print(f"\n--- Step {step_num} ---") print(f" Raw: {raw_response}") print(f" Action: {parsed_action.get('action', 'N/A')}") print(f" Reason: {parsed_action.get('reason', 'N/A')}")
async def main(): async with Robot(backend="gazebo", scene="office_v2") as robot: result = await execute_vlm_instruction( robot, instruction="navigate to the kitchen area", model="gpt-4o", max_steps=25, on_step=log_step, ) print(f"\nFinal result: success={result.success}") print(f"Description: {result.description}") print(f"Images collected: {len(result.images)}")
asyncio.run(main())执行过程中的输出如下:
--- Step 0 --- Raw: {"action": "rotate_left", "angle": 0.8, "reason": "Looking for kitchen, turning to scan the room"} Action: rotate_left Reason: Looking for kitchen, turning to scan the room
--- Step 1 --- Raw: {"action": "move_forward", "distance": 1.0, "reason": "I can see a counter and sink ahead, moving toward kitchen"} Action: move_forward Reason: I can see a counter and sink ahead, moving toward kitchen
--- Step 2 --- Raw: {"action": "move_forward", "distance": 0.5, "reason": "Getting closer to the kitchen counter"} Action: move_forward Reason: Getting closer to the kitchen counter
--- Step 3 --- Raw: {"action": "done", "reason": "I am now in the kitchen area near the counter"} Action: done Reason: I am now in the kitchen area near the counter
Final result: success=TrueDescription: I am now in the kitchen area near the counterImages collected: 4运行时实际发生了什么
Section titled “运行时实际发生了什么”这里是一个具体的例子。指令:“find the red bottle and stop near it.”
第 0 步:机器人捕获第一张图像。它看到一个有桌子、椅子和架子的办公室。VLM 回复:rotate_left, angle=0.6, reason="scanning room for red bottle"。
第 1 步:旋转后,摄像头现在显示不同的角度。可以看到一个有各种物品的架子。VLM 回复:move_forward, distance=1.2, reason="I can see something red on the shelf ahead"。
第 2 步:现在更近了。VLM 可以清楚地识别出架子第二层上的红色瓶子。它回复:move_forward, distance=0.8, reason="approaching the red bottle on the shelf"。
第 3 步:机器人现在距离架子大约 0.3m。VLM 回复:done, reason="I am near the red bottle on the shelf"。
总执行时间:4 次 VLM 调用,大约 3 秒的 API 时间,2.5 秒的机器人运动。ExecutionResult 对象包含 success=True、描述文本,以及执行过程中收集的全部 4 张摄像头图像。
Sim2Real:真实硬件上的相同代码
Section titled “Sim2Real:真实硬件上的相同代码”完全相同的脚本可以在真实硬件上运行。只需更改一个参数:
# In simulationasync with Robot(backend="gazebo") as robot: result = await robot.execute_instruction("find the red bottle")
# On real hardware -- literally the only changeasync with Robot(backend="real") as robot: result = await robot.execute_instruction("find the red bottle")VLM 推理是相同的,因为它基于摄像头图像进行操作,与图像来源无关。运动命令(move_forward、rotate)由后端抽象层执行,该层在真实硬件上映射到 ROS2/Nav2,在仿真中映射到 Gazebo 物理引擎。
VLM runner 会检测指令中的中日韩字符,并相应调整其系统提示:
async with Robot(backend="gazebo") as robot: result = await robot.execute_instruction("找到红色的瓶子并停在它旁边") print(f"结果: {result.description}")模型将在 reason 字段中使用中文进行推理,同时保持动作键(action keys)为英文以确保可靠解析。
VLM 与 VLA 模型结合
Section titled “VLM 与 VLA 模型结合”对于研究工作流,你可能希望使用 VLM 进行高层推理,使用 VLA(Vision-Language-Action)模型进行精细运动控制:
import asyncioimport numpy as npfrom threewe import Robotfrom threewe.ai.vlm_runner import VLMRunnerfrom threewe.ai.vla_runner import VLARunner
async def hybrid_control(): vlm = VLMRunner(model="gpt-4o", max_steps=10) vla = VLARunner.from_pretrained("lerobot/act_3we_nav")
async with Robot(backend="gazebo") as robot: for step in range(50): image = robot.get_image()
# High-level: ask VLM what to do if step % 10 == 0: plan = vlm.plan(image, "navigate to the charging station") print(f"VLM plan: {plan}")
# Low-level: VLA generates smooth motor commands obs = robot.get_observation(modalities=["image", "lidar", "velocity"]) action = vla.predict(obs, instruction="go to charging station") robot.execute_action(action)
asyncio.run(hybrid_control())这种模式将 VLM 作为每 N 步触发一次的”战略顾问”,而 VLA 负责逐帧处理运动命令并生成平滑轨迹。
| 指标 | 数值 |
|---|---|
| VLM API 延迟(GPT-4o) | 每步约 800ms |
| 图像编码时间 | 约 5ms |
| 每步总循环时间 | 约 1.2s(含运动) |
| 典型任务完成步数 | 3-8 步 |
| 每步 Token 消耗 | 约 800 输入,约 50 输出 |
对于延迟敏感的应用,考虑:
- 使用
max_steps=10限定执行时间上限 - 切换到本地 VLM(Qwen-VL、LLaVA),通过
THREEWE_VLM_BASE_URL配置 - 缓存相似场景的 VLM 响应
- 使用 VLA runner 进行反应式控制,仅在需要重新规划时调用 VLM
循环设计为具有弹性:
import asynciofrom threewe import Robot, NavigationError, TimeoutError
async def robust_vlm_control(): async with Robot(backend="gazebo") as robot: try: result = await robot.execute_instruction( "find the exit door and stop in front of it" ) if result.success: print(f"Task completed: {result.description}") else: print(f"Task failed after max steps: {result.description}") # result.images contains all camera frames for analysis print(f"Collected {len(result.images)} frames for post-mortem") except NavigationError as e: print(f"Navigation failed: {e}") except TimeoutError as e: print(f"Operation timed out: {e}")
asyncio.run(robust_vlm_control())如果 VLM 返回无效的 JSON,该步骤会被跳过,循环继续。如果导航动作失败(障碍物、超时),下一个 VLM 步骤会看到当前状态并可以自适应调整。
支持的 VLM 后端
Section titled “支持的 VLM 后端”| 后端 | 模型 | 配置 |
|---|---|---|
| OpenAI | GPT-4o, GPT-4-turbo | OPENAI_API_KEY |
| Qwen | Qwen-VL-Max, Qwen-VL-Plus | THREEWE_VLM_BASE_URL + 兼容密钥 |
| 本地 | LLaVA, CogVLM | 自定义 base_url 指向本地服务 |
| Azure OpenAI | GPT-4o | 通过 base_url 配置 Azure 端点 |
任何支持 OpenAI chat completions API 且接受图像输入的模型均可开箱即用。
完整可运行示例:物体搜索
Section titled “完整可运行示例:物体搜索”以下是一个完整的、可运行的示例,它搜索一个物体,记录每一步,并保存轨迹图像:
"""VLM-controlled object search.
Requirements: pip install threewe[ai] export OPENAI_API_KEY="sk-..."
Usage: python vlm_search.py"""
import asyncioimport jsonfrom pathlib import Path
from threewe import Robotfrom threewe.ai.vlm_runner import execute_vlm_instruction
INSTRUCTION = "find the red bottle and stop near it"OUTPUT_DIR = Path("vlm_results")
def on_step(step: int, raw: str, action: dict) -> None: print(f"[Step {step:02d}] {action.get('action', '?'):14s} | {action.get('reason', '')}")
async def main() -> None: OUTPUT_DIR.mkdir(exist_ok=True)
async with Robot(backend="gazebo", scene="office_v2") as robot: result = await execute_vlm_instruction( robot, instruction=INSTRUCTION, model="gpt-4o", max_steps=20, on_step=on_step, )
print(f"\n{'='*50}") print(f"Success: {result.success}") print(f"Description: {result.description}") print(f"Steps taken: {len(result.images)}")
# Save trajectory images for analysis for i, img in enumerate(result.images): path = OUTPUT_DIR / f"step_{i:03d}.jpg" try: from PIL import Image import numpy as np
pil_img = Image.fromarray(img[:, :, ::-1]) # BGR to RGB pil_img.save(str(path)) except ImportError: pass
print(f"Images saved to: {OUTPUT_DIR}/")
if __name__ == "__main__": asyncio.run(main())pip install threewe[ai]这会安装核心 SDK 以及 VLM 集成所需的 openai 和 Pillow 依赖。
2. 设置 API 密钥
Section titled “2. 设置 API 密钥”export OPENAI_API_KEY="sk-your-key-here"3. 启动仿真(可选,用于测试)
Section titled “3. 启动仿真(可选,用于测试)”threewe launch --backend gazebo --scene office_v24. 运行脚本
Section titled “4. 运行脚本”import asynciofrom threewe import Robot
async def main(): async with Robot(backend="gazebo") as robot: result = await robot.execute_instruction("find the red bottle and stop near it") print(f"Done! Success={result.success}: {result.description}")
asyncio.run(main())5. 切换到真实硬件
Section titled “5. 切换到真实硬件”async with Robot(backend="real") as robot: result = await robot.execute_instruction("find the red bottle and stop near it")无需其他代码更改。
这是进入具身智能最简单的入口。从这里,你可以:
- 使用 VLM 执行收集的轨迹训练 VLA 模型
- 使用 Gymnasium 环境(
threewe.gym)进行强化学习训练 - 运行基准测试(
threewe benchmark run --task objectnav)来衡量你的智能体 - 向社区排行榜提交结果
关键洞察是:你不需要理解 ROS2、Nav2、SLAM 或电机控制就能让机器人通过语言完成有用的任务。threewe SDK 将所有这些都抽象在一个 Python 原生接口之后。