Categories
Pages
-

DBIS

A Natural Language Interface for the Semantic Data Lake system (SEDAR) via LLMs

July 17th, 2024

Thesis Type
  • Master
Status
Running
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Christoph Quix
Advisor(s)
Sayed Hoseini
Contact
sayed.hoseini@hs-niederrhein.de

Motivation

Data lake systems have been proposed for several years as repositories in which heterogeneous data can be stored and merged [1]. To develop such systems, various technologies from the fields of big data, databases and machine learning must be combined. In recent years, several student projects have developed various elements of a data lake system called SEDAR, which are now to be integrated into the production environment. The programming languages used were Python (Flask) in the backend and JavaScript (React) in the frontend. Interested parties are welcome to read this paper [3], watch this video and browse the code.

The landscape of scientific research is undergoing a transformative paradigm shift with the advent of Large Language Models (LLMs). Leveraging advanced natural language processing capabilities, LLMs have the potential to enhance information retrieval, knowledge synthesis, and hypothesis generation.

In the context of data management, the interaction with databases using natural language is relevant. Because SEDAR deploys various heterogeneous databases and machine learning models in the backend, integrating their data and metadata can be challenging. The goal of this project is to develop a module that is aware of the available APIs of the data lake system to return relevant information about data sources and to provide relevant context for reasoning and decision making. The developed application should be able to answer questions posed in natural language by the user about the stored data assets in the lake and also act as an agent to facilitate the interactions with the system (e.g. “create a new ML experiment”, “upload this dataset”, …). Key challenges in building these applications include orchestration, data engineering, prompt engineering, debugging, and evaluation. Collaboration between different skillsets (e.g. prompt engineering and data engineering) is an important consideration.

The focus of the work can be chosen by the students in consultation with the supervisors. Possible tasks are described below, but should not be strictly limited to these objectives, but should ideally further develop the prototype as a whole:

Tasks:

  • Familiarization with SEDAR
  • Identify the most relevant tasks for a semantic data management platform from contemporary scientific literature.
  • Develop a draft for a Natural Language Interface for SEDAR based on LLMs
  • Implementation of a module to teach an LLM how to act as an agent to utilize SEDAR’s backend APIs (e.g. prompt engineering, RAG-architecture, …)
  • Develop verification and validation functions for the generated API calls
  • Demonstration of the usability and effectiveness of the implemented extension on a use case from the i2DACH project

Supervision:

This thesis will be supervised formally by Prof. Stefan Decker (1st) and Christoph Quix (2nd) at the i5 of the Computer Science department. It is a joint effort between the HSNR University in Krefeld and the RWTH with Sayed Hoseini being the main student advisor. Due to this set up, most of the meetings will be held online.

Interested? Questions? Please contact us!

Sayed Hoseini, M.Sc. – sayed.hoseini@hs-niederrhein.de

Please send a CV, Transcript of Records and a short description of yourself (work & programming experiences) and why you are interested in this project.

Literature:

  1. C. Quix, R. Hai: Data Lake. In S. Sakr, A.Y. Zomaya (Eds.): Encyclopedia of Big Data Technologies, Springer 2019. https://doi.org/10.1007/978-3-319-63962-8_7-1
  2. Karmaker, Shubhra Kanti, et al. “Automl to date and beyond: Challenges and opportunities.” ACM Computing Surveys (CSUR), 54.8 (2021): 1-36.
  3. Hoseini, Sayed et al. SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), 2023. Association for Computing Machinery, New York, NY, USA, 5056–5060. https://doi.org/10.1145/3583780.3614753

Prerequisites:
  • Proven experience in software programming
  • Python OO programming
  • Frontend programming with JavaScript, preferably with the framework React