This thesis investigates whether Vision-and-Language Navigation (VLN) can be reliably transferred from conventional benchmarks to subway tunnel environments, enabling a quadruped robot to execute inspection-oriented navigation tasks under constrained geometry, degraded visibility, and limited connectivity. The work is motivated by recent vision-language-action approaches that connect language grounding with embodied control for legged platforms (e.g., NaVILA) [1], while the applicability of such paradigms to tunnel settings remains underexplored. The study uses an existing tunnel environment dataset (visual and structural information) and a high-fidelity tunnel simulation setup to train and evaluate a VLN model. Evaluation will focus on instruction-following success, path efficiency, robustness to tunnel-specific disturbances, and (optionally) transfer to real-world deployment on a physical quadruped robot, following standard VLN evaluation practices [2,6]. This paper is co-advised by me and Fan Yang (yang@icom.rwth-aachen.de) at ICoM (Institute for Construction Management, Digital Engineering and Robotics in Construction). The second supervisor is Dr. Hendrik Morgenstern (morgenstern@icom.rwth-aachen.de).
Thesis Type |
|
Student |
Jing Wu |
Status |
Running |
Presentation room |
Seminar room I5 - 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Yixin Peng |
Contact |
peng@dbis.rwth-aachen.de |
Background
Classical autonomous navigation is often realized as a pipeline with multiple interacting modules (perception, mapping/localization, planning, control), where each stage introduces modeling choices and potential error propagation [4, 5]. While modularity supports interpretability and engineering control, it can be brittle when sensing quality degrades or when assumptions (e.g., stable illumination, reliable communication) are violated.
VLN addresses navigation from a different angle: it frames navigation as an embodied grounding problem where an agent must interpret natural-language instructions in the context of its visual observations and select actions accordingly [2, 6]. Benchmark-driven progress in VLN has been enabled by large-scale indoor datasets and simulators—most prominently the Room-to-Room (R2R) benchmark and the associated Matterport3D simulation ecosystem [6]—as well as embodied-AI platforms such as Habitat [7]. Building on this foundation, modern VLN systems increasingly leverage transformer-based multimodal encoders (e.g., VLN↻BERT) to improve instruction–vision alignment and history-aware decision making [8], and data augmentation / pragmatic reasoning schemes to reduce supervision bottlenecks [9]. Recent surveys further document a shift toward foundation-model-driven embodied planning and reasoning, and discuss how such models may reshape VLN method design and evaluation [3].
However, most established VLN settings implicitly assume everyday operating conditions (e.g., indoor homes/offices) with comparatively favorable sensing and infrastructure [2, 6]. Subway tunnels differ substantially: tunnel inspection literature highlights environmental factors such as absence of natural light, dust/humidity, and other adverse conditions that complicate perception and operation [10]. Work on underground robotics more broadly emphasizes persistent challenges including communication limitations and degraded visibility in subterranean environments [11]. Moreover, narrow corridors with irregular obstacles create stringent geometric constraints that stress both locomotion and planning; recent studies on quadruped inspection in cable-tunnel-like environments underscore these constraints and their practical relevance [12]. These characteristics motivate a systematic investigation of how far VLN—especially when paired with legged locomotion—can be adapted to safety-critical tunnel applications, and what training/evaluation practices are required to achieve robust performance [1, 3].
Tasks
a) Comprehensive review and reproduction of reference methods
- Study selected VLN / vision-language-action reference methods with emphasis on:
- Reproduce reported results on the benchmarks used by NaVILA [1].
b) Dataset preparation and preprocessing
- Convert the existing tunnel environment data into a VLN-ready format, including:
- trajectory representation, observations, and action/state definitions
- train/validation/test splits and quality controls (noise, imbalance, missing data)
- Deliverables:
- documented dataset schema and preprocessing pipeline
- reproducible data generation scripts and a dataset statistics report
c) Model training and fine-tuning
- Select a baseline VLN model and adapt it to tunnel characteristics (e.g., following a VLA-style two-level design for legged robots) [1].
- Train on the tunnel dataset with iterative tuning of:
- hyperparameters and optimization schedule
- regularization and robustness interventions
- Deliverables:
- training code, configuration files, and an ablation plan
- trained checkpoints and training logs (reproducibility package)
d) Simulation-based validation and (optional) real-world deployment
- Evaluate in a high-fidelity tunnel simulation environment, reporting:
- Optional: deploy on a physical quadruped robot for field testing; assess sim-to-real transfer, identify systematic discrepancies, and document failure cases.
- Deliverables:
- evaluation report with quantitative metrics and qualitative error analysis
- summary of practical feasibility and recommendations for tunnel VLN deployment
References
- Cheng, A.-C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., Bıyık, E., Yin, H., Liu, S., Wang, X.: NaVILA: Legged Robot Vision-Language-Action Model for Navigation. arXiv preprint arXiv:2412.04453 (2024)
- Gu, J., Stefani, E., Wu, Q., Thomason, J., Wang, X.E.: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7606–7623. Association for Computational Linguistics (2022)
- Zhang, Y., Ma, Z., Li, J., Qiao, Y., Wang, Z., Chai, J., Wu, Q., Bansal, M., Kordjamshidi, P.: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models. Transactions on Machine Learning Research (TMLR) (2024)
- Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press (2005)
- LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006). https://doi.org/10.1017/CBO9780511546877
- Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3674–3683 (2018)
- Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: Habitat: A Platform for Embodied AI Research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9339–9347 (2019)
- Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A Recurrent Vision-and-Language BERT for Navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1643–1653 (2021)
- Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: Speaker-Follower Models for Vision-and-Language Navigation. In: Advances in Neural Information Processing Systems 31 (NeurIPS). pp. 3318–3329 (2018)
- Menendez, E., Victores, J.G., Montero, R., Martínez, S., Balaguer, C.: Tunnel Structural Inspection and Assessment Using an Autonomous Robotic System. Automation in Construction 87, 117–126 (2018). https://doi.org/10.1016/j.autcon.2017.12.001
- Konieczna-Fuławka, M., Koval, A., Nikolakopoulos, G., Fumagalli, M., Santas Moreu, L., Vigara-Puche, V., Müller, J., Prenner, M.: Autonomous Mobile Inspection Robots in Deep Underground Mining—The Current State of the Art and Future Perspectives. Sensors 25(12), 3598 (2025). https://doi.org/10.3390/s25123598
- Wu, J., Huang, Y., Lai, Y., Yang, S., Zhang, C.: Obstacle Avoidance Inspection Method of Cable Tunnel for Quadruped Robot Based on Particle Swarm Algorithm and Neural Network. Scientific Reports 15, 36065 (2025). https://doi.org/10.1038/s41598-025-19903-w
- Solid programming skills in Python; experience with deep learning frameworks (e.g., PyTorch)
- Background in robotics navigation concepts (state estimation / SLAM, planning, control)
- Familiarity with multimodal learning (vision-language models, transformers)
- Experience with robotics simulation and deployment workflows