Categories
Pages
-

DBIS

An Empirical Study of Open Source Large Language Models (OSLLMs)

November 20th, 2023

Open Source Software (OSS) revolutionised the computing world about three decades ago. One of the principles of OSS guarantees software developers, companies, researchers, and students the freedom to change and improve the software. Characterised by active community involvement (bazaar-style software development), OSS development has produced category-killer Operating Systems (e.g., Debian, Ubuntu) and applications (e.g., the Apache HTTP Server, Firefox). 

The computer science community is now riding another revolution called the Large Language Models (LLMs) revolution. Various variants of LLMs (Commercial and Open Source) come with billions of parameters that developers can fine-tune to control how the system generates text (tokens). Commercial LLMs (e.g., ChatGPT) come with a copyright and are expensive to deploy and use. They have also been criticised for their hallucination, lack of transparency, and the potential for monopolisation by big corporations.

Research methodology

Thesis Type
  • Master
Student
Du Cheng
Status
Running
Proposal on
12/04/2024 11:30 am
Proposal room
Seminar room I5 6202
Presentation room
Seminar room I5 6202
Supervisor(s)
Stefan Decker
Advisor(s)
Sulayman K. Sowe
Yongli Mou
Alexander Neumann
Contact
sowe@dbis.rwth-aachen.de
mou@dbis.rwth-aachen.de
neumann@dbis.rwth-aachen.de

With Open-Source Large Language Models (OSLLMs), software developers, researchers, and students can customise and improve LLMs according to their needs. However, there is little or no empirical research to help us understand the OSLLM landscape.

The aim of this thesis research project is to answer some research questions that can help us empirically understand the OSLLMs projects’ landscape. For example,

  • What are the knowledge artefacts generated, archived, and shared by OSLLM developers and user?
  • Who are the main code contributors in an OSLLM project?
  • Who are the developers who can users improve or customise the source code of an LLM?
  • What is the discussion atmosphere in an LLM project like?

Tasks:

  1. Carry out a literature review of Open-Source Large Language Models (OSLLMs) and document (see example- OSLLM Tasks (1 – 3) Documentation)
    • the code repository of each project (e.g., GitHub) and
    • the platforms (discussion forums, mailing lists, social media, blogs, etc.) they use to disseminate project information.
  2. Extract and analyse code contributions of the developers in the code repository of the OSLLMs reviewed in T1, in a process similar to Contribution analytics.
  3. Extract and analyse developers’ or users’ discussions (posts and replies) in the respective platforms of the OSLLMs reviewed in T1.
  4. Document the methodologies and challenges encountered in T2 and T3.

Relevant references:

  1. Zhao et al. (2023). A Survey of Large Language Models. https://arxiv.org/pdf/2303.18223.pdf.
  2. Easy to read, OSLLMs vs LLMs. https://medium.com/@hamsa.a.j/the-rise-of-open-source-llms-bdf566393107 and https://bdtechtalks.com/2023/05/29/open-source-llms-cerebras-gpt/
  3. You can find everything about OSLLMs projects at: https://huggingface.co/ . You will work a lot on this platform!
  4. Sulayman K. Sowe, Ioannis Stamelos, Lefteris Angelis, Understanding Knowledge Sharing Activities in Free/Open Source Software Projects: An Empirical Study, Journal of Systems and Software, Vol. 81 (3), 2008, Pages 431-446, https://doi.org/10.1016/j.jss.2007.03.086. Check the methodology outline in Fig. 2.
  5. Kaiyuan Gao, et al. (2023). Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models, eprint (2308.14149), https://arxiv.org/abs/2308.14149
  6. Hamsa Jama (2023). The Rise of Open Source LLMs: A Journey into the Future of AI. Available at: https://medium.com/@hamsa.a.j/the-rise-of-open-source-llms-bdf566393107, 18.09.2023
  7. Ben Dickson (May 29, 2023). Understanding the impact of open-source language models. Available at: https://bdtechtalks.com/2023/05/29/open-source-llms-cerebras-gpt/ , 18.09.2023.
  8. Forcepoint (2023). ChatGPT on Open vs. Closed Source Large Language Models for Internal AI Projects.

Prerequisites:
  1. Good programming skills in Python or a suitable scripting language.
  2. Experience in using and managing software repositories or Concurrent Versions Systems (CVS) such as GitHub, GitLab, BitBucket, SVN, etc.
  3. Ability to carry out statistical analysis using R or SPSS.
  4. Knowledge of the structure and nature of discussing threads in forums and mailing lists.
  5. Good reading and writing skills in German and English.
  6. Ability to quickly adapt to working in a large multicultural academic environment.