This thesis aims to develop a single-cell-centric biomedical foundation model that leverages the capabilities of generative pre-trained transformers to enhance the analysis of single-cell RNA data. The model will address critical tasks in single-cell biology, such as cell-type annotation, perturbation prediction, identification of pathogenic cells, and gene network inference. This thesis is co-supervised by Sikander Hayat and Rafael Kramann, Department of Medicine II, University Hospital Aachen. Please send your application to Yongli Mou, M.Sc. (mou@dbis.rwth-aachen.de) and CC. Dr. Sikander Hayat (shayat@ukaachen.de)
Thesis Type |
|
Status |
Open |
Presentation room |
Seminar room I5 6202 |
Supervisor(s) |
Stefan Decker |
Advisor(s) |
Yongli Mou |
Contact |
mou@dbis.rwth-aachen.de |
Background
The rapid growth of single-cell sequencing technologies has enabled researchers to study cellular diversity in greater detail, which is crucial for understanding disease mechanisms, developmental biology, and therapeutic responses. However, existing models for single-cell data often lack scalability and generalizability. Foundation models, particularly those built on transformer architectures, have demonstrated versatility across different domains, such as language and computer vision, by capturing task-agnostic knowledge. Inspired by this, the potential for a single-cell foundation model lies in its ability to handle the high-dimensional nature of single-cell data, allowing for a unified framework that supports a wide range of biological inquiries.
Objectives
- Develop a foundation model and pre-train the model on massive single-cell data
- Fine-tune the model to perform tasks such as cell-type annotation, perturbation prediction, and gene network inference, prediction of pathogenic cells
Tasks
- Literature review and analysis of current state-of-the-art.
- Review the existing single-cell analysis tools and foundation models in biomedicine.
- Identify gaps in current methodologies and challenges specific to single-cell data.
- Data collection and preprocessing
- Compile a comprehensive dataset for model training, including scRNA-seq, single-cell ATAC-seq and spatial data from public databases.
- Data sources: Single-cell datasets will be collected from:
- Apply preprocessing techniques such as normalization, batch effect correction, and data augmentation to handle sparsity and noise in single-cell data.
- Compile a comprehensive dataset for model training, including scRNA-seq, single-cell ATAC-seq and spatial data from public databases.
- Model development, pre-training and fine-tuning, and evaluation
- Design a transformer-based model architecture that incorporates specialized attention mechanisms for single-cell data.
- Train the model using a combination of self-supervised learning for pretraining and supervised fine-tuning for specific tasks.
- Evaluate model performance on cell-type classification, perturbation response prediction, and gene network inference tasks.
References
- Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024 Feb 26:1-1.
Knowledge in Machine Learning, Biology, and Multi-omics (e.g., genomic, proteomic, transcriptomic, epigenomic, etc.)
Programming language – Python
Deep Learning Framework – PyTorch, Transformers