ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

1Southeast University 2Shanghai AI Laboratory

Abstract

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling the execution of long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning processes remain constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry-critical factors for informed navigation decisions. In this work, we explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this by developing the imagination-powered navigation framework ImagineNav++, which imagines the future observation images at valuable robot views and translates the complex navigation planning process into a rather simple best-view image selection problem for VLMs. Specifically, we first introduce a future-view imagination module, which distills human navigation preferences to generate semantically meaningful candidate viewpoints with high exploration potential. These imagined future views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrate keyframe observations through a sparse-to-dense framework, thereby constructing a compact yet comprehensive memory for long-term spatial reasoning. This integrated approach effectively transforms the challenging goal-oriented navigation problem into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks demonstrate that our ImagineNav++ achieves SOTA performance in mapless setting, even surpassing most cumbersome map-based methods, revealing the importance of scene imagination and scene memory in VLM-based spatial reasoning.

Approach

Interpolate start reference image.

The framework consists of four key components: the future-view imagination module (Where2Imagine + NVS), the selective foveation memory module, the VLM-based high-level planner, and the low-level PointNav controller. Through an iterative cycle of imagination, reasoning, and execution, it decomposes long-horizon goal-oriented navigation into tractable sub-tasks without requiring explicit mapping.

Results

Experiment 1 results

Table I. Comparison with previous work on object-goal navigation.

Experiment 2: Real-to-Sim Reconstruction

Table II. Comparison with previous work on instance-image-goal navigation.

Navigation Demo

BibTeX

@misc{wang2025imaginenavpromptingvisionlanguagemodels,
      title={ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination}, 
      author={Teng Wang and Xinxin Zhao and Wenzhe Cai and Changyin Sun},
      year={2025},
      eprint={2512.17435},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.17435}, 
}