VLAs for Navigation

Introduction

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for grounding natural-language instructions in perception and control, enabling robots to execute high-level commands specified in human-friendly terms. In this project, we consider a navigation setting in which a mobile robot (Unitree GO2) receives an egocentric RGB-D observation together with a short natural-language instruction (e.g., 'go to the red box','come to the door') and must output low-level navigation commands (such as velocities or waypoints) that drive it toward the described target [1,2].

Conditioning actions on language offers several advantages over traditional goal specification mechanisms (e.g., manually defined goal coordinates or semantic labels): it allows users to specify tasks in an intuitive way, enables compositional generalization to novel instructions and scenes, and naturally integrates contextual information about objects and landmarks. At the same time, deploying VLA models on real robots raises important questions regarding data efficiency, robustness, sim-to-real transfer, and integration with existing navigation stacks. This project aims to study these aspects in the context of a quadruped robot platform.

Objectives

The main objectives of the project are:

  • Conduct a comprehensive survey of VLA models for navigation and related embodied instruction-following methods, highlighting why language-conditioned policies are valuable and what open challenges remain
  • Develop a simulated navigation benchmark in Isaac Lab / Isaac Sim comprising:
    • a Go2 quadruped robot equipped with an RGB-D camera
    • a set of objects and landmarks (e.g., boxes, chairs, doors) with associated textual descriptions
    • a suite of tasks such as point-goal and object-goal navigation specified through natural-language instructions.
  • Implement a basic VLA policy that includes:
    • a vision encoder for egocentric RGB-D observations,
    • a language encoder for natural-language instructions,
    • a fusion module and MLP head that output low-level navigation actions.
  • Design and implement a data-collection and training pipeline (e.g., scripted policies, exploration policies, or teacher demonstrations) for learning the VLA navigation policy in simulation.
  • Deploy and evaluate the learned policy on the physical Go2 platform, assessing success rate, robustness to real-world sensing noise, and sim-to-real transfer quality.
  • Release a clean, well-documented, and reproducible codebase hosted on git, including configuration files and scripts for simulation, training, and deployment.

Contact:

Georges Jetti: georges.jetti@polimi.it
Michael Khayyat: michael.khayyat@polimi.it
Stefano Arrigoni: stefano.arrigoni@polimi.it

References

[1] Cheng, An-Chieh, et al. “Navila: Legged robot vision-language-action model for navigation.” arXiv preprint arXiv:2412.04453 (2024).

[2] Qin, Xinyao, et al. “Integrating Diffusion-based Multi-task Learning with Online Reinforcement Learning for Robust Quadruped Robot Control.” arXiv preprint arXiv:2507.05674 (2025).