Single-Cell Centric Biomedical Foundation Models for Cancer

October 17th, 2024

This thesis aims to develop a single-cell-centric biomedical foundation model that leverages the capabilities of generative pre-trained transformers to enhance the analysis of single-cell RNA data. The model will address critical tasks in single-cell biology, such as cell-type annotation, perturbation prediction, identification of pathogenic cells, and gene network inference.

This thesis is co-supervised by Sikander Hayat and Rafael Kramann, Department of Medicine II, University Hospital Aachen.

Please send your application to Yongli Mou, M.Sc. (mou@dbis.rwth-aachen.de) and CC. Dr. Sikander Hayat (shayat@ukaachen.de)

Thesis Type	Master
Student	Ang Li
Status	Running
Presentation room	Seminar room I5 6202
Supervisor(s)	Stefan Decker
Advisor(s)	Yongli Mou
Contact	mou@dbis.rwth-aachen.de

Background

The rapid growth of single-cell sequencing technologies has enabled researchers to study cellular diversity in greater detail, which is crucial for understanding disease mechanisms, developmental biology, and therapeutic responses. However, existing models for single-cell data often lack scalability and generalizability. Foundation models, particularly those built on transformer architectures, have demonstrated versatility across different domains, such as language and computer vision, by capturing task-agnostic knowledge. Inspired by this, the potential for a single-cell foundation model lies in its ability to handle the high-dimensional nature of single-cell data, allowing for a unified framework that supports a wide range of biological inquiries.

Objectives

Develop a foundation model and pre-train the model on massive single-cell data
Fine-tune the model to perform tasks such as cell-type annotation, perturbation prediction, and gene network inference, prediction of pathogenic cells

Tasks

Literature review and analysis of current state-of-the-art.
- Review the existing single-cell analysis tools and foundation models in biomedicine.
- Identify gaps in current methodologies and challenges specific to single-cell data.
Data collection and preprocessing
- Compile a comprehensive dataset for model training, including scRNA-seq, single-cell ATAC-seq and spatial data from public databases.
  - Data sources: Single-cell datasets will be collected from:
- Apply preprocessing techniques such as normalization, batch effect correction, and data augmentation to handle sparsity and noise in single-cell data.

Model development, pre-training and fine-tuning, and evaluation
- Design a transformer-based model architecture that incorporates specialized attention mechanisms for single-cell data.
- Train the model using a combination of self-supervised learning for pretraining and supervised fine-tuning for specific tasks.
- Evaluate model performance on cell-type classification, perturbation response prediction, and gene network inference tasks.

References

Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024 Feb 26:1-1.

Prerequisites:

Knowledge in Machine Learning, Biology, and Multi-omics (e.g., genomic, proteomic, transcriptomic, epigenomic, etc.)
Programming language – Python
Deep Learning Framework – PyTorch, Transformers

Related Projects:

WestAI - KI-Services aus NRW für Deutschland

DBIS

Single-Cell Centric Biomedical Foundation Models for Cancer

Quick Links

Recent News

Recent Publications