Cancer remains one of the most devastating diseases worldwide, with high mortality rates and limited long-term treatment success. While therapies have advanced, most current interventions only provide marginal improvements in overall survival and progression-free survival. A major reason for this limited success lies in the heterogeneity of cancer cells—even within a single tumor, cells can differ dramatically in their molecular states, drug sensitivities, and potential for metastasis. Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to study this complexity. Unlike traditional bulk RNA sequencing, scRNA-seq allows us to profile gene expression at the resolution of individual cells, revealing subtle differences in cell populations, lineage trajectories, and cell–cell interactions. Recent advances in platforms such as 10x Genomics have enabled the profiling of millions of cells in a single experiment, offering unprecedented insights into disease biology. Despite progress, one crucial gap remains: the lack of a comprehensive, cancer-focused cell atlas that integrates diverse datasets to capture malignant cell heterogeneity across tissues. Moreover, while large foundation models have transformed fields like natural language processing and computer vision, their potential in single-cell biology—and specifically cancer biology—remains largely unexplored. Early efforts such as scGPT and Geneformer demonstrate the promise of large-scale, self-supervised learning in biology, but no existing cross-organ, cancer-specific foundation model currently exists to connect cellular states with patient outcomes such as survival and treatment resistance.
| Manager(s) |
Stefan Decker Sandra Geisler |
| Funding | Exzellenzuniversität (EXU); RWTH ERS Seed Fund |
| Project Start | October 01, 2025 |
| Project End | September 30, 2026 |
| Status | Running |
Our project, Mosaic, aims to close this gap by building the first-ever malignant cell atlas based on single-cell transcriptomics and powered by a deep AI foundation model. Mosaic will unify and standardize publicly available scRNA-seq datasets, providing a resource to systematically study cancer cell heterogeneity across tissues.
We envision Mosaic as a cross-disciplinary effort, with contributions spanning computer engineering, AI, and biology:
-
Computer Engineering Dimension
-
Develop an integrated data preprocessing and harmonization pipeline to handle massive-scale single-cell transcriptomics data efficiently.
-
Ensure scalability and reproducibility, enabling robust training of foundation models across diverse datasets.
-
-
AI Dimension
-
Build a cancer-specific foundation model tailored to the unique characteristics of single-cell transcriptomics data.
-
Leverage state-of-the-art architectures such as Transformers and Graph Neural Networks (GNNs) to learn robust and transferable representations.
-
Enable downstream tasks such as cell-type annotation, gene regulatory network inference, perturbation modeling, and phenotype prediction.
-
-
Biology Dimension
-
Decode the heterogeneity of malignant cells across different cancers, identifying shared and distinct cellular states.
-
Link cellular states to clinical phenotypes, including overall survival and progression-free survival, enabling predictive insights into patient outcomes.
-
Nominate novel precision drug targets through AI-guided perturbation modeling and network analysis.
-
By unifying single-cell transcriptomics and foundation models, Mosaic aims to serve as a standardized, community-wide resource. This atlas will not only deepen our understanding of malignant cell states but also accelerate the discovery of precision medicine therapies in cancer.