Print Close

Genomic Inferences in the Era of Black Box Predictions

Presented During: Rietz Award & Lecture

Kathryn Roeder Speaker
Carnegie Mellon University

Sunday, Aug 3: 4:00 PM - 4:50 PM
Invited Paper Session

Music City Center

Since the advent of high throughput genomic techniques, myriad statistical challenges have arisen due to high dimensionality and missing data. To obtain sufficient sample size, it is often necessary to combine data across related studies; in the process, valuable data can be lost due to high rates of missingness and due to differing experimental designs. Intriguingly, however, powerful black-box models have been remarkably successful in filling in the missing data. The question that arises is, how can we adjust inferential techniques to account for the additional uncertainty induced by imputation? Here we illustrate several genomic applications in which we overcome these challenges.

While quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of thousands of proteins in molecular mechanisms, analysis of such data are challenging due to the large proportion of missing values. To address this issue, a common strategy imputes missing data, although it often introduces systematic bias into downstream analyses if the imputation errors are ignored. We develop a statistical framework inspired by doubly robust estimators that offers valid and efficient inferences for proteomic data. Our framework utilizes a customized variational autoencoder (VAE) to obtain excellent imputation quality, and a propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework, which allows us to gain additional, meaningful discoveries and yet maintain good control of false positives.

Recent advances in single-cell technologies enable joint profiling of multiple omics. These profiles can reveal the complex interplay of different regulatory layers in single cells; still, new challenges arise when integrating datasets with some features shared across experiments and others exclusive to a single source. Combining information across these sources is called mosaic integration. The missing data can be imputed with surprising success using our customized VAE, but conducting inference across these integrated samples is still challenging. We frame this problem in the context of semi-supervised learning, and assume the modality of interest is measured in a smaller supervised dataset, while it is unmeasured in the much larger unsupervised sample (i.e., vanishing overlap). We extend available theoretical results to accommodate our setting. Our methods apply to a wide range of smooth statistical targets – including means, linear coefficients, quantiles, and causal effects – and remain valid under high-dimensional nuisance estimation, distributional shift between labeled and unlabeled samples, and overlap that vanishes as sample size increases. We construct estimators that are doubly robust and asymptotically normal by deriving influence functions under this regime. A key insight is that classical root-n convergence fails under vanishing overlap; we instead provide corrected asymptotic rates that capture the impact of the decay in overlap. We apply our methods to multi-omic single-cell samples.

Single-cell RNA sequencing used in conjunction with CRISPR-based perturbation (Perturb-seq) can uncover the function of genes; however, it can be costly to perform as many perturbations experiments as desired. Ideally it would be possible to use a model to predict the outcome of perturbations related to those already performed. Despite their high dimensionality and sparsity, these data have shown themselves to be amenable to analysis by deep learning methods, which provides us with a framework for this task. We utilize a model combining VAE and denoising diffusion models to generate realistic single-cell RNA-seq data. Remarkably we have had success in generating data for perturbation experiments that were never performed, provided we have a rich set of data from related experiments. We use ideas derived from semiparametric inference literature to obtain inferential techniques that are somewhat successful in this challenging setting.

Keywords

genomics

machine learning

prediction-powered inference

missing values

debiasing

proteomics