Thesis Type |
|
Student |
Pedram Ahmadiyeh |
Status |
Running |
Proposal on |
07/06/2024 11:15 am |
Proposal room |
Seminar room I5 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Mehdi Akbari G. osen |
Contact |
mehdi.akbari.gurabi@fit.fraunhofer.de oemer.sen@fit.fraunhofer.de |
The primary aim of this thesis is to explore the data infrastructure necessary for training large language models (LLMs) specifically for incident response playbooks in the field of cybersecurity, with an emphasis on adhering to the CACAO format at a later production stage. This research addresses the significant challenge of limited data availability for incident response playbooks, which is crucial for training LLMs. A central question is whether synthetically generated data can effectively train LLMs to produce high-quality incident response playbooks.
The thesis will include an extensive review and analysis of existing research on data generation for deep learning applications, particularly focusing on large language models. This will involve a comparative analysis of the latest research and state-of-the-art methodologies. The thesis will establish a structured and systematic categorization of various methods and approaches in this field.
A comprehensive requirement analysis is essential to determine criteria for both evaluation and the selection of suitable technologies for the thesis objectives. Based on these criteria, the thesis will model the envisioned data, detailing its specifications and attributes. This model will form the basis for developing a concept and design for the data generation approach, which will then be implemented using the selected technologies.
Furthermore, the thesis will develop a detailed investigation procedure. This will include a thorough description of experimental setups, investigation environments, evaluation methods, and assumptions and conditions. This procedure will be employed to systematically assess the suitability of the developed data generation approach in addressing the scarcity of data for training LLMs.
Finally, based on the findings, the thesis will propose guidelines for the data generation approach. It will also offer a demonstrative example of the developed approach, showcasing its practical application and effectiveness in generating data for LLM training in the context of cybersecurity incident response playbooks.
These are resources for the thesis topic as an example:
- Playbook Examples and guidelines: These links will provide simple practical examples of cybersecurity playbooks:
- https://github.com/phantomcyber/playbooks
- https://gitlab.com/syntax-ir/playbooks
- https://publica-rest.fraunhofer.de/server/api/core/bitstreams/76b8ef20-de93-45cb-b8dc-17de7a8ad354/content
- https://www.cisa.gov/sites/default/files/publications/Federal_Government_Cybersecurity_Incident_and_Vulnerability_Response_Playbooks_508C.pdf
- OASIS CACAO Specification: This document details the Collaborative Automated Course of Action Operations (CACAO) standard for cybersecurity playbooks: https://docs.oasis-open.org/cacao/security-playbooks/v2.0/security-playbooks-v2.0.html
- Fine-tuning with OpenAI: This resource from OpenAI discusses fine-tuning, a method for effectively utilizing Large Language Models (LLMs) like GPT-3 or GPT-4. Understanding how to fine-tune models that results in better outcomes will be key in automating the playbook translation process: https://platform.openai.com/docs/guides/fine-tuning/when-to-use-fine-tuning
- Some relevant articles:
- https://dl.acm.org/doi/pdf/10.1145/3604237.3626908
- https://link.springer.com/chapter/10.1007/978-3-031-48421-6_7
- arxiv.org/pdf/2107.06499.pdf
- sec21-carlini-extracting.pdf (usenix.org)
- arxiv.org/pdf/1909.08053.pdf?trk=public_post_comment-text
- Planning for Natural Language Failures with the AI Playbook (acm.org)
- 2305.02783.pdf (arxiv.org)
Basic knowledge in the domains of cyber security, Natural Language Processing (Specifically, Generative AI).