Learned compression and latent-space analytics for multi-modal spatial biology
Code: BBSRC-DFA_2026_27
Primary Supervisor: Cédric M. John
Email: Cedric.john@qmul.ac.uk
Institute: Digital Environment Research Institute
Secondary Supervisor: Ines Sequeira
Email: i.sequeira@qmul.ac.uk
Institute: Institute of Dentistry
Abstract:
Spatial biology technologies such as highly multiplexed imaging and spatial transcriptomics generate extremely high-dimensional data, with spatial locations encoding a rich molecular profile. While these datasets offer unprecedented insight into tissue organisation, cellular heterogeneity, and biological processes, their size and complexity present major challenges for storage, computation, sharing, and downstream analysis. This PhD project will develop novel deep learning methods to learn compact, biologically meaningful latent representations of spatial biology data that enable efficient compression and high-fidelity reconstruction.
The core objective is to design and benchmark task-aware representation learning frameworks, including convolutional and transformer-based autoencoders and vector-quantised latent models, tailored to the spatial and multimodal nature of omics data. These models will be optimised to balance compression ratio, reconstruction accuracy, and preservation of biological signal. Crucially, the project will evaluate whether compressed latent spaces retain sufficient information to support downstream analytical tasks, such as identification of distinct cell populations, detection of rare cellular states, spatial niche characterisation, and tissue-state classification.
Where co-registered multimodal data are available, the project will explore shared or coupled latent representations across modalities to assess whether exploiting cross-modal structure improves compression efficiency and preservation of biological signal. Compression performance will be analysed as a function of spatial representation, progressing from dense, image-like molecular fields to discretely sampled and event-based spatial transcriptomic representations.
Using available large-scale spatial datasets from the second supervisor, the project will establish reproducible benchmarks for learned compression in spatial biology.
Lay Summary:
Modern biology can now create incredibly detailed images of tissues, where each tiny point in the image contains information about the genes and proteins inside cells and how they are organised in space. These “spatial biology” technologies allow scientists to study how different cell types are arranged, how they interact with their neighbours, and how complex tissues are structured. However, a major challenge is that these datasets are enormous, often so large that they are difficult to store, share, and analyse efficiently.
This PhD project aims to develop new artificial intelligence (AI) tools to make these large spatial biology datasets much smaller, while preserving the important biological information they contain. The idea is similar to how photos or videos can be compressed on a phone to save space.
Using advanced methods, the project will train computer models to transform very large spatial datasets into compact summaries, known as “latent representations.” These summaries must be small enough to reduce storage and computational demands, but detailed enough to preserve meaningful biological patterns. The research will test whether compressed data can still be used to answer important biological questions, such as distinguishing different cell populations, detecting rare cellular states, and understanding how cells are organised within tissues.
This project will help researchers work more effectively with large datasets, support data sharing and collaboration, and enable new insights into the organisation and function of complex biological systems.
Aims and Objectives:
The primary aim of the project is to develop and validate deep learning methods that generate compact, biologically faithful latent representations of spatial biology data. To achieve this, the project will first design and implement generative encoder–decoder architectures tailored to multi-channel spatial omics data, exploring convolutional, transformer-based, and variational formulations. Particular attention will be paid to the structure of the latent space, investigating continuous and discrete representations, spatially structured embeddings, and regularisation strategies that encourage compactness without loss of biological signal.
The project will then establish a rigorous benchmarking framework to quantify compression ratio, reconstruction fidelity, and preservation of biological structure. Models will be evaluated on large, publicly available spatial datasets spanning different tissues and experimental conditions. Independent test cohorts will be used to assess generalisability. Learned representations will be compared against standard dimensionality reduction and classical compression baselines to define performance gains in a transparent and reproducible manner.
A further objective is to demonstrate that compressed latent spaces retain biological utility. The project will assess whether tasks such as cell-type classification, detection of rare cellular states, spatial niche characterisation, and tissue-state classification can be performed directly in latent space without loss of predictive accuracy. Computational efficiency and scalability will be quantified to determine whether latent representations enable faster and more resource-efficient analysis. The relationship between compression strength and task performance will be systematically characterised, providing a principled understanding of compression–utility trade-offs.
Finally, the project will deliver reproducible, well-documented software tools that enable learned compression of spatial biology data within standard workflows. These tools will be designed for interoperability and open dissemination, supporting community uptake and alignment with responsible AI and open science principles central to DFA training.