Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named NavigateDiff constructs a visual predictor that continuously predicts the agent's potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goal-reaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings.
NavigateDiff utilizes the logical knowledge and the generalization ability of pre-trained foundation models to improve zero-shot navigation in the presence of novel environments, scenes, and objects. How can we achieve this when foundation models trained on general-purpose Internet data do not provide direct guidance for selecting low-level navigation actions? Our key insight is to decouple the navigation problem into two stages: (I) generating intermediate future goals that need to be reached to successfully navigate, and (II) learning low-level control policies for reaching these future goals. In Stage (I), we build a Predictor by incorporating a Multimodal Large Language Model (MLLM) with a diffusion model, fine-tuned with parameter efficiency, specifically designed for Image Navigation. Stage (II) involves training a Fusion Navigation Policy on image navigation data and testing in new environments. We describe data collection and each of these stages in detail below and summarize the resulting navigation algorithm.
Multimodal Large Language Model. Given a current observation x_t, a goal image x_g, and a textual instruction y, the Predictor generates a future frame. As shown in Fig. 2, the current observation and goal image are processed through a frozen image encoder, while the textual instruction is tokenized and passed to the LLM. The Predictor can now generate Special Image Tokens image based on the instruction’s intent, though initially limited to the language modality. These tokens are then sent to the Future Frame Prediction Model to produce the final future frame prediction. In practice, we utilized LLaVA as our base model, which consists of a pre-trained CLIP visual encoder (ViT-L/14) and Vicuna-7B as the LLM backbone. To fine-tune the LLM, we applied LoRA, initializing a trainable matrix to adapt the model for the task.
NavigateDiff leverages a Future Frame pretrained Predictor to generate future frames based on current observation and navigation task information. A fusion navigation policy then executes the actions needed to reach the future frames. Alternating this loop enables us to accomplish the navigation task.
As shown in Fig. 4, we designed a Hybrid Fusion approach to fuse image features and compared its performance with Early Fusion and Late Fusion. In Early Fusion, the current observation, future frame prediction, and goal image are concatenated along the RGB channels and then passed through a visual encoder for feature extraction. While this method can capture pixel-level semantic relationships among the three images, it struggles to effectively associate the logical relationships among them. In contrast, Late Fusion processes the three images separately through the visual encoder and then fuses them at the feature level, but this approach fails to capture pixel-level semantic correlations, leading to suboptimal performance.
We implement three image-level metrics to evaluate the Predictor’s generation ability. (1) Frechet In- ception Distance (FID), (2) Peak Signal-to-Noise Ratio (PSNR), (3)Learned Perceptual Image Patch Similarity (LPIPS). We measure the similarity between the generated future frame and ground truth. In terms of image-level metrics in Tab. I, our Predictor outperforms IP2P by a large margin (0.66, 0.14, and 1.43 respectively) in all three metrics on the Gibson dataset. In Fig. 5, we also visualize predicted future frame sequences and trajectory rollouts in the Gibson dataset. We observe that generating future frames one by one could efficiently guide the PolicyNet in action generation.
In Tab. II, we present a detailed comparison between our model and several state-of-the-art approaches across various metrics.The results highlight the superior performance of our model, particularly in challenging navigation scenarios. To further evaluate the generalization capability of our approach, we conducted additional experiments using a smaller dataset that poses a greater challenge in terms of data availability. Despite the reduced dataset size, our model not only maintained its performance but also outperformed the baseline models. This demonstrates the robustness and adaptability of our model, suggesting it can effectively generalize to new environments even with limited training data.
As illustrated in Tab. III, we test the model on the MP3D dataset as part of a cross-task evaluation. Our NavigateDiff achieves 68.0% Success Rate (SR) and 41.1% Success weighted by Path Length (SPL) by using a smaller training dataset, surpassing both existing methods on the full dataset and the baseline.
In our real-world experiments, we focused on indoor environments to evaluate the zero-shot navigation capabilities of our model, NavigateDiff. As illustrated in Fig. 6, we conducted tests in three types of indoor environments: an office, a parking lot, and a corridor. Each environment represents a unique set of challenges in terms of layout, lighting, and obstacles. The office setting is characterized by cluttered spaces, including desks, chairs, and other furniture. The indoor parking lot represents a semi-structured environment with clearly defined paths and open spaces but is filled with parked vehicles that act as static obstacles. The corridor is a long, narrow space with fewer obstacles but presents challenges in terms of navigation through tight spaces and sharp turns.
As detailed in Tab. IV, We evaluate the performance of NavigateDiff in terms of success rate and SPL. The metric across the three real-world scenarios demonstrates that our model consistently surpasses the baseline.
As shown in Tab. V, we evaluate different fusion strategies on the Gibson ImageNav task. Our proposed Hybrid Fusion achieves 91.0% SR and 64.8% SPL, significantly outperforming both Early Fusion and Late Fusion. These results demonstrate the effectiveness of Hybrid Fusion in integrating future frames into the navigation policy.
We introduced NavigateDiff, a novel approach that leverages logical reasoning and generalization capabilities of pretrained foundation models to enhance zero-shot navigation by predicting future observations for robust action generation. By integrating temporal information and using a Hybrid Fusion framework to guide policy decisions, our approach significantly improves navigation performance. Extensive experiments demonstrate its efficiency and adaptability in both simulated and real-world environments.
@article{qin2025navigatediff,
title={NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants},
author={Qin, Yiran and Sun, Ao and Hong, Yuze and Wang, Benyou and Zhang, Ruimao},
journal={arXiv preprint arXiv:2502.13894},
year={2025}
}