Exploration and Application of Vision-Language Navigation (VLN) for Legged Robots in Subway Tunnel Environments

January 30th, 2026

This thesis investigates whether Vision-and-Language Navigation (VLN) can be reliably transferred from conventional benchmarks to subway tunnel environments, enabling a quadruped robot to execute inspection-oriented navigation tasks under constrained geometry, degraded visibility, and limited connectivity. The work is motivated by recent vision-language-action approaches that connect language grounding with embodied control for legged platforms (e.g., NaVILA) [1], while the applicability of such paradigms to tunnel settings remains underexplored.

The study uses an existing tunnel environment dataset (visual and structural information) and a high-fidelity tunnel simulation setup to train and evaluate a VLN model. Evaluation will focus on instruction-following success, path efficiency, robustness to tunnel-specific disturbances, and (optionally) transfer to real-world deployment on a physical quadruped robot, following standard VLN evaluation practices [2,6].

This paper is co-advised by me and Fan Yang (yang@icom.rwth-aachen.de) at ICoM (Institute for Construction Management, Digital Engineering and Robotics in Construction). The second supervisor is Dr. Hendrik Morgenstern (morgenstern@icom.rwth-aachen.de).

Thesis Type	Master
Student	Jing Wu
Status	Running
Presentation room	Seminar room I5 - 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Yixin Peng
Contact	peng@dbis.rwth-aachen.de

Background

Classical autonomous navigation is often realized as a pipeline with multiple interacting modules (perception, mapping/localization, planning, control), where each stage introduces modeling choices and potential error propagation ^{[4, 5]}. While modularity supports interpretability and engineering control, it can be brittle when sensing quality degrades or when assumptions (e.g., stable illumination, reliable communication) are violated.

VLN addresses navigation from a different angle: it frames navigation as an embodied grounding problem where an agent must interpret natural-language instructions in the context of its visual observations and select actions accordingly ^{[2, 6]}. Benchmark-driven progress in VLN has been enabled by large-scale indoor datasets and simulators—most prominently the Room-to-Room (R2R) benchmark and the associated Matterport3D simulation ecosystem ^[6]—as well as embodied-AI platforms such as Habitat ^[7]. Building on this foundation, modern VLN systems increasingly leverage transformer-based multimodal encoders (e.g., VLN↻BERT) to improve instruction–vision alignment and history-aware decision making ^[8], and data augmentation / pragmatic reasoning schemes to reduce supervision bottlenecks ^[9]. Recent surveys further document a shift toward foundation-model-driven embodied planning and reasoning, and discuss how such models may reshape VLN method design and evaluation ^[3].

However, most established VLN settings implicitly assume everyday operating conditions (e.g., indoor homes/offices) with comparatively favorable sensing and infrastructure ^{[2, 6]}. Subway tunnels differ substantially: tunnel inspection literature highlights environmental factors such as absence of natural light, dust/humidity, and other adverse conditions that complicate perception and operation ^[10]. Work on underground robotics more broadly emphasizes persistent challenges including communication limitations and degraded visibility in subterranean environments ^[11]. Moreover, narrow corridors with irregular obstacles create stringent geometric constraints that stress both locomotion and planning; recent studies on quadruped inspection in cable-tunnel-like environments underscore these constraints and their practical relevance ^[12]. These characteristics motivate a systematic investigation of how far VLN—especially when paired with legged locomotion—can be adapted to safety-critical tunnel applications, and what training/evaluation practices are required to achieve robust performance ^{[1, 3]}.

Tasks

a) Comprehensive review and reproduction of reference methods

Study selected VLN / vision-language-action reference methods with emphasis on:
- Model architecture, training strategy, and data requirements ^{[1, 2, 3, 6, 7, 8, 9]}
- Evaluation protocols and common failure modes ^{[2, 3, 6]}
Reproduce reported results on the benchmarks used by NaVILA ^[1].

b) Dataset preparation and preprocessing

Convert the existing tunnel environment data into a VLN-ready format, including:
- trajectory representation, observations, and action/state definitions
- train/validation/test splits and quality controls (noise, imbalance, missing data)
Deliverables:
- documented dataset schema and preprocessing pipeline
- reproducible data generation scripts and a dataset statistics report

c) Model training and fine-tuning

Select a baseline VLN model and adapt it to tunnel characteristics (e.g., following a VLA-style two-level design for legged robots) ^[1].
Train on the tunnel dataset with iterative tuning of:
- hyperparameters and optimization schedule
- regularization and robustness interventions
Deliverables:
- training code, configuration files, and an ablation plan
- trained checkpoints and training logs (reproducibility package)

d) Simulation-based validation and (optional) real-world deployment

Evaluate in a high-fidelity tunnel simulation environment, reporting:
- navigation success rate and path efficiency
- collision/obstacle avoidance behavior
- responsiveness and correctness w.r.t. language instructions
- robustness to low-light / perceptual degradation scenarios (as available in simulation) ^{[10, 11, 12]}
Optional: deploy on a physical quadruped robot for field testing; assess sim-to-real transfer, identify systematic discrepancies, and document failure cases.
Deliverables:
- evaluation report with quantitative metrics and qualitative error analysis
- summary of practical feasibility and recommendations for tunnel VLN deployment

References

Cheng, A.-C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X.: NaVILA: Legged Robot Vision-Language-Action Model for Navigation. arXiv preprint arXiv:2412.04453 (2024)
Gu, J., Stefani, E., Wu, Q., Thomason, J., Wang, X.E.: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7606–7623. Association for Computational Linguistics (2022)
Zhang, Y., Ma, Z., Li, J., Qiao, Y., Wang, Z., Chai, J., Wu, Q., Bansal, M., Kordjamshidi, P.: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models. Transactions on Machine Learning Research (TMLR) (2024)
Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press (2005)
LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006). https://doi.org/10.1017/CBO9780511546877
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3674–3683 (2018)
Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: Habitat: A Platform for Embodied AI Research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9339–9347 (2019)
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A Recurrent Vision-and-Language BERT for Navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1643–1653 (2021)
Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: Speaker-Follower Models for Vision-and-Language Navigation. In: Advances in Neural Information Processing Systems 31 (NeurIPS). pp. 3318–3329 (2018)
Menendez, E., Victores, J.G., Montero, R., Martínez, S., Balaguer, C.: Tunnel Structural Inspection and Assessment Using an Autonomous Robotic System. Automation in Construction 87, 117–126 (2018). https://doi.org/10.1016/j.autcon.2017.12.001
Konieczna-Fuławka, M., Koval, A., Nikolakopoulos, G., Fumagalli, M., Santas Moreu, L., Vigara-Puche, V., Müller, J., Prenner, M.: Autonomous Mobile Inspection Robots in Deep Underground Mining—The Current State of the Art and Future Perspectives. Sensors 25(12), 3598 (2025). https://doi.org/10.3390/s25123598
Wu, J., Huang, Y., Lai, Y., Yang, S., Zhang, C.: Obstacle Avoidance Inspection Method of Cable Tunnel for Quadruped Robot Based on Particle Swarm Algorithm and Neural Network. Scientific Reports 15, 36065 (2025). https://doi.org/10.1038/s41598-025-19903-w

Prerequisites:

Solid programming skills in Python; experience with deep learning frameworks (e.g., PyTorch)
Background in robotics navigation concepts (state estimation / SLAM, planning, control)
Familiarity with multimodal learning (vision-language models, transformers)
Experience with robotics simulation and deployment workflows

DBIS