Categories
Pages
-

DBIS

Exploration and Application of Vision-Language Navigation (VLN) for Legged Robots in Subway Tunnel Environments

January 30th, 2026

This thesis investigates whether Vision-and-Language Navigation (VLN) can be reliably transferred from conventional benchmarks to subway tunnel environments, enabling a quadruped robot to execute inspection-oriented navigation tasks under constrained geometry, degraded visibility, and limited connectivity. The work is motivated by recent vision-language-action approaches that connect language grounding with embodied control for legged platforms (e.g., NaVILA) [1], while the applicability of such paradigms to tunnel settings remains underexplored.

The study uses an existing tunnel environment dataset (visual and structural information) and a high-fidelity tunnel simulation setup to train and evaluate a VLN model. Evaluation will focus on instruction-following success, path efficiency, robustness to tunnel-specific disturbances, and (optionally) transfer to real-world deployment on a physical quadruped robot, following standard VLN evaluation practices [2,6].

This paper is co-advised by me and Fan Yang (yang@icom.rwth-aachen.de) at ICoM (Institute for Construction Management, Digital Engineering and Robotics in Construction). The second supervisor is Dr. Hendrik Morgenstern (morgenstern@icom.rwth-aachen.de).

Thesis Type
  • Master
Student
Jing Wu
Status
Running
Presentation room
Seminar room I5 - 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Yixin Peng
Contact
peng@dbis.rwth-aachen.de

Background

Classical autonomous navigation is often realized as a pipeline with multiple interacting modules (perception, mapping/localization, planning, control), where each stage introduces modeling choices and potential error propagation [4, 5]. While modularity supports interpretability and engineering control, it can be brittle when sensing quality degrades or when assumptions (e.g., stable illumination, reliable communication) are violated.

VLN addresses navigation from a different angle: it frames navigation as an embodied grounding problem where an agent must interpret natural-language instructions in the context of its visual observations and select actions accordingly [2, 6]. Benchmark-driven progress in VLN has been enabled by large-scale indoor datasets and simulators—most prominently the Room-to-Room (R2R) benchmark and the associated Matterport3D simulation ecosystem [6]—as well as embodied-AI platforms such as Habitat [7]. Building on this foundation, modern VLN systems increasingly leverage transformer-based multimodal encoders (e.g., VLN↻BERT) to improve instruction–vision alignment and history-aware decision making [8], and data augmentation / pragmatic reasoning schemes to reduce supervision bottlenecks [9]. Recent surveys further document a shift toward foundation-model-driven embodied planning and reasoning, and discuss how such models may reshape VLN method design and evaluation [3].

However, most established VLN settings implicitly assume everyday operating conditions (e.g., indoor homes/offices) with comparatively favorable sensing and infrastructure [2, 6]. Subway tunnels differ substantially: tunnel inspection literature highlights environmental factors such as absence of natural light, dust/humidity, and other adverse conditions that complicate perception and operation [10]. Work on underground robotics more broadly emphasizes persistent challenges including communication limitations and degraded visibility in subterranean environments [11]. Moreover, narrow corridors with irregular obstacles create stringent geometric constraints that stress both locomotion and planning; recent studies on quadruped inspection in cable-tunnel-like environments underscore these constraints and their practical relevance [12]. These characteristics motivate a systematic investigation of how far VLN—especially when paired with legged locomotion—can be adapted to safety-critical tunnel applications, and what training/evaluation practices are required to achieve robust performance [1, 3].


Tasks

a) Comprehensive review and reproduction of reference methods

  • Study selected VLN / vision-language-action reference methods with emphasis on:
    • Model architecture, training strategy, and data requirements [1, 2, 3, 6, 7, 8, 9]
    • Evaluation protocols and common failure modes [2, 3, 6]
  • Reproduce reported results on the benchmarks used by NaVILA [1].

b) Dataset preparation and preprocessing

  • Convert the existing tunnel environment data into a VLN-ready format, including:
    • trajectory representation, observations, and action/state definitions
    • train/validation/test splits and quality controls (noise, imbalance, missing data)
  • Deliverables:
    • documented dataset schema and preprocessing pipeline
    • reproducible data generation scripts and a dataset statistics report

c) Model training and fine-tuning

  • Select a baseline VLN model and adapt it to tunnel characteristics (e.g., following a VLA-style two-level design for legged robots) [1].
  • Train on the tunnel dataset with iterative tuning of:
    • hyperparameters and optimization schedule
    • regularization and robustness interventions
  • Deliverables:
    • training code, configuration files, and an ablation plan
    • trained checkpoints and training logs (reproducibility package)

d) Simulation-based validation and (optional) real-world deployment

  • Evaluate in a high-fidelity tunnel simulation environment, reporting:
    • navigation success rate and path efficiency
    • collision/obstacle avoidance behavior
    • responsiveness and correctness w.r.t. language instructions
    • robustness to low-light / perceptual degradation scenarios (as available in simulation) [10, 11, 12]
  • Optional: deploy on a physical quadruped robot for field testing; assess sim-to-real transfer, identify systematic discrepancies, and document failure cases.
  • Deliverables:
    • evaluation report with quantitative metrics and qualitative error analysis
    • summary of practical feasibility and recommendations for tunnel VLN deployment

References

  1. Cheng, A.-C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X.: NaVILA: Legged Robot Vision-Language-Action Model for Navigation. arXiv preprint arXiv:2412.04453 (2024)
  2. Gu, J., Stefani, E., Wu, Q., Thomason, J., Wang, X.E.: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7606–7623. Association for Computational Linguistics (2022)
  3. Zhang, Y., Ma, Z., Li, J., Qiao, Y., Wang, Z., Chai, J., Wu, Q., Bansal, M., Kordjamshidi, P.: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models. Transactions on Machine Learning Research (TMLR) (2024)
  4. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press (2005)
  5. LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006). https://doi.org/10.1017/CBO9780511546877
  6. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3674–3683 (2018)
  7. Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: Habitat: A Platform for Embodied AI Research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9339–9347 (2019)
  8. Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A Recurrent Vision-and-Language BERT for Navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1643–1653 (2021)
  9. Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: Speaker-Follower Models for Vision-and-Language Navigation. In: Advances in Neural Information Processing Systems 31 (NeurIPS). pp. 3318–3329 (2018)
  10. Menendez, E., Victores, J.G., Montero, R., Martínez, S., Balaguer, C.: Tunnel Structural Inspection and Assessment Using an Autonomous Robotic System. Automation in Construction 87, 117–126 (2018). https://doi.org/10.1016/j.autcon.2017.12.001
  11. Konieczna-Fuławka, M., Koval, A., Nikolakopoulos, G., Fumagalli, M., Santas Moreu, L., Vigara-Puche, V., Müller, J., Prenner, M.: Autonomous Mobile Inspection Robots in Deep Underground Mining—The Current State of the Art and Future Perspectives. Sensors 25(12), 3598 (2025). https://doi.org/10.3390/s25123598
  12. Wu, J., Huang, Y., Lai, Y., Yang, S., Zhang, C.: Obstacle Avoidance Inspection Method of Cable Tunnel for Quadruped Robot Based on Particle Swarm Algorithm and Neural Network. Scientific Reports 15, 36065 (2025). https://doi.org/10.1038/s41598-025-19903-w

Prerequisites:
  • Solid programming skills in Python; experience with deep learning frameworks (e.g., PyTorch)
  • Background in robotics navigation concepts (state estimation / SLAM, planning, control)
  • Familiarity with multimodal learning (vision-language models, transformers)
  • Experience with robotics simulation and deployment workflows