On the Evaluation of Retrieval Augmented Generation-based Cypher Queries

September 17th, 2024

Thesis Type	Master
Student	Khalil Baydoun
Status	Running
Presentation room	Seminar room I5 6202
Supervisor(s)	Stefan Decker Sandra Geisler
Advisor(s)	Ahmad Hemid Anastasiia Belova
Contact	ahmad.hemid@fit.fraunhofer.de belova@dbis.rwth-aachen.de

Motivation:

Knowledge Graphs (KGs) integrate data from various sources, representing entities like events, places, and people and their relationships. Since Google introduced KGs in 2012, they have been widely adopted in applications such as chatbots, search engines, and recommendation systems. For instance, integrating large language models (LLMs) with KGs enables chatbots to answer complex questions more accurately by leveraging semantic links between concepts [1]. KGs also enhance traditional recommender systems by offering transparent explanations that build user trust [3].

Companies now use KGs for tasks like knowledge representation and LLM augmentation. KG-enhanced LLMs help LLMs understand structured knowledge, while LLM-augmented KGs leverage LLMs to enhance KG tasks such as question answering [4]. As KGs grow, user-friendly query languages are needed to extract relevant data. The labeled property graph model, used in graph databases like Neo4j, has driven the creation of Cypher, a query language for graph data [DFR23]. However, learning Cypher can be challenging, prompting interest in natural language queries. LLMs can translate these queries into Cypher, though challenges remain, such as LLMs generating incorrect or incomplete results.

To improve LLM performance, customization methods like in-context learning (ICL), fine-tuning, and Retrieval-Augmented Generation (RAG) are employed. ICL provides example-based guidance, fine-tuning tailors models for specific tasks, and RAG augments LLM input with real-time external data, helping overcome the model’s static nature [2]. RAG is particularly effective in addressing token limits and is more cost-efficient than fine-tuning. Lastly, the absence of an evaluation benchmark for Cypher query generation necessitates a custom framework to assess accuracy.

The focus of the work can be chosen by the students in consultation with the supervisors. Possible tasks are described below, but should not be strictly limited to these objectives, but should ideally further develop the prototype as a whole:

Tasks:

The contribution tasks for this thesis are based on evaluating and improving the performance of LLMs in generating Cypher queries, particularly focusing on tasks such as:

LLM Performance Comparison:
- Comparing how well leading commercial LLMs (like Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro) and open-source models (like Llama 3.1) generate accurate Cypher queries. This task will assess which models perform best when translating natural language to structured query language.
RAG Efficiency:
- Measuring the efficiency of the Retrieval-Augmented Generation (RAG) architecture in generating Cypher queries. This involves analyzing performance metrics, such as speed, cost, and accuracy, in the context of how RAG integrates real-time external information for query generation.
Correctness Evaluation Framework:
- Developing a framework to evaluate and validate the correctness of Cypher queries generated by LLMs. This task focuses on ensuring that the generated queries exactly match the labels of nodes and relationships in the Neo4j database.
Impact of RAG on LLM Performance:
- Investigating how the baseline performance of LLMs (without RAG) compares to LLMs integrated with RAG. This task examines the overall impact of RAG on query generation performance.

Supervisors:

This thesis will be formally supervised by Prof. Stefan Decker (primary supervisor) and Sandra Geisler (secondary supervisor) at the i5 Chair of the Computer Science Department.

Advisors:

Anastasiia Belova, M.Sc. – belova@dbis.rwth-aachen.de and Ahmad Hemid, M.Sc. – ahmad.hemid@fit.fraunhofer.de

Literature:

[1] Enayat Rajabi, Allu Niya George, and Karishma Kumar. “, The Electronic Library, Vol. 42 No. 3, pp. 483-497. https://doi.org/10.1108/EL-03-2023-0066

[2] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi,
Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval Augmented Generation for Large Language Models: A Survey, 2024, https://doi.org/10.48550/arXiv.2312.1099.

[3] Jin-Cheng Zhang, Azlan Mohd Zain, Kai-Qing Zhou, Xi Chen, Ren-Min Zhang, A review of recommender systems based on knowledge graph embedding, Expert Systems with Applications, Volume 250, 2024, 123876, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2024.123876.

[4] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang and X. Wu, “Unifying Large Language Models and Knowledge Graphs: A Roadmap,” in IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 7, pp. 3580-3599, July 2024, https://doi.org/10.1109/TKDE.2024.3352100.

Prerequisites:

Proven experience in software programming
Python OO programming
Knowledge in Large Language Models
Knowledge about Semantic Web Technologies

DBIS

On the Evaluation of Retrieval Augmented Generation-based Cypher Queries

Motivation:

Tasks:

Supervisors:

Advisors:

Literature:

Quick Links

Recent News

Recent Publications