Contributed Poster Presentations: Section on Statistics in Genomics and Genetics

Shirin Golchi Chair
McGill University
 
Tuesday, Aug 5: 2:00 PM - 3:50 PM
4122 
Contributed Posters 
Music City Center 
Room: CC-Hall B 

Main Sponsor

Section on Statistics in Genomics and Genetics

Presentations

41: Addressing Heterogeneous Sensitivity in Biomarker Screening with Application in NanoString nCounter

Biomarkers are measurable indicators of biological processes and have wide biomedical applications including disease screening and prognosis prediction. Candidate biomarkers can be screened in high-throughput settings, which allow simultaneous measurements of a large number of molecules. For binary biomarkers, the ability to detect a molecule may be hindered by the presence of background noise and the variable signal strength, which lower the sensitivity to a different extent for different target molecules in a sample-specific manner. This heterogeneity in detection sensitivity is often overlooked and leads to an inflated false positive rate. We propose a novel sensitivity adjusted likelihood-ratio test (SALT), which properly controls the false positives and is more powerful than the unadjusted approach. We show that sample-and-feature-specific detection sensitivity can be well estimated from NanoString nCounter data, and using the estimated sensitivity in SALT results in improved biomarker screening. 

Keywords

High-throughput biomarker screening

Binary biomarker

Detection sensitivity

Sample-and-feature-specific sensitivity

Hypothesis testing

NanoString nCounter 

Co-Author

Zhijin Wu, Brown University

First Author

Chang Yu

Presenting Author

Chang Yu

42: Application of the STAAR Framework in Detecting Rare Variant Associations with Alzheimer's Disease and Related Dementias: Insights and Implications

Introduction: Rare genetic variation is considered a potential source of heritability in individuals with sporadic Alzheimer's Disease and related dementias (ADRD). The STAAR framework leverages multiple functional annotations of genetic variants and combines association statistics from multiple variant aggregation-based methods, including burden, SKAT, and ACAT-V, into a single measure of significance.

Method: Using whole genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP), we comprehensively examined the association of rare genetic variation with ADRD in 23,455 individuals (37% ADRD cases) and with cognitively healthy elder status in 13,292 individuals (13% cognitively healthy elders) from diverse populations via the STAAR framework.

Results: We identified several genes significantly associated with ADRD or cognitively healthy status. However, our analysis revealed several limitations within the STAAR framework incorporating ultra-rare variants with dichotomous outcomes. To enhance the robustness of the framework, we proposed several computational refinements, including creating a burden of ultra-rare variants and employing more precise annotations to match with expected mechanism. After implementing the proposed modifications, the association with ADRD for ZNF200 was no longer statistically significant (α=1x10-7), while TBX19, PLXNB2, CARD11, and LINC01880 remained significantly associated with cognitively healthy status.

Conclusion: We identified and addressed the computational limitations in the STAAR framework that could lead to potential spurious results for ultra-rare variant aggregates with an extremely low cumulative minor allele count. Our proposed refinements produced more robust results for associations with rare variants in the context of dichotomous outcomes.
 

Keywords

Rare varaint analysis

STAAR framework

Alzheimer's disease 

Co-Author(s)

Nancy Heard-Costa, Department of Medicine, Boston University School of Medicine;NHLBI Framingham Heart Study
Andy Rampersaud, Research Computing Services, Information Services & Technology, Boston University
Eden Martin, University of Miami-Miami Institute of Human Genomics
Adam Naj, Department of Biostatistics, Epidemiology, and Informatics, Department of Pathology and Laboratory
Bilcag Akgun, John P Hussman Institute for Human Genomics
Brian Kunkle, John P Hussman Institute for Human Genomics; John T Macdonald Department of Human Genetics
Gina Peloso
Anita DeStefano, Department of Biostatistics, Boston University School of Public Health
Xihao Li, University of North Carolina at Chapel Hill
Seung Hoan Choi, Department of Biostatistics, Boston University School of Public Health

First Author

Dongyu Wang, Department of Biostatistics, Boston University School of Public Health

Presenting Author

Dongyu Wang, Department of Biostatistics, Boston University School of Public Health

43: Comparison of Linear Mixed-Effect and Deep Learning Models for Predicting Phenotypes using GWAS

Large biobank studies, such as the UK Biobank, provide us with unprecedented opportunities to predict various phenotypes with their rich genome-wide association studies (GWAS) data collected from massive populations. The adoption of linear mixed models (LMMs) to predict phenotypes was a significant milestone and a major success in the history of GWAS. Nevertheless, the classic LMM-based methods for GWAS data often fail to account for the dependence structure between single nucleotide polymorphisms (SNPs). Meanwhile, recently, deep learning has demonstrated remarkable success in computer vision, protein structure prediction and functional genomics. Deep learning is able to model complex non-linear relationships and can exploit dependent structure among features. Therefore, it is of great interest to compare the predictive capabilities between classic LMM-based methods and deep learning models for GWAS data. Here, we systematically compare the performance of LMM-based methods and deep learning models in predicting a dozen phenotypes using the UK Biobank data and discuss the strengths and limitations of both approaches. 

Keywords

Genome-Wide Association Studies (GWAS)

Linear Mixed-Effect Models

Deep Learning

Biobank 

Co-Author

Yingying Wei, The Chinese University of Hong Kong

First Author

Muhammad Danish

Presenting Author

Muhammad Danish

44: Detecting and quantifying mediation of health outcomes by microbial communities

Many studies of human microbiome epidemiology have focused on the effects of health outcomes and exposures on the microbiome or the effects of microbiome on health outcomes. However, there's increasing interest in understanding complex relationships where exposures alter microbiome composition, which thereby affects the health outcomes (i.e., "mediates" exposure effect on health). Such hypotheses can be tested by statistical mediation analysis, but typical methods are not appropriate for microbiome data due to zero-inflation, compositionality, and high-dimensionality. Using realistic simulated microbiome data, we compared the performance of (1) low-dimensional mediation methods, (2) high-dimensional, non-compositional mediation, and (3) specialized methods for microbiome under differing circumstances. We further compared these methods in two real-world datasets assessing the effect of diet on cardiometabolic disease. We make recommendations on best methods for total direct effect and total/component indirect effects. Notably, no one method performed the best in all tests, indicating the nuance in microbiome mediation analyses and the need for new methods. 

Keywords

human microbiome

mediation analysis

microbiome epidemiology

metagenomics 

Co-Author(s)

Emma Accorsi, Harvard T. H. Chan School of Public Health
Eric Franzosa, Harvard T. H. Chan School of Public Health
Nicole Levesque, Harvard T. H. Chan School of Public Health
Siyuan Ma, Vanderbilt University Medical Center
Curtis Huttenhower, Harvard School of Public Health

First Author

Haoyue Li

Presenting Author

Haoyue Li

45: Drug-DMsim: A Novel Pipeline for Inferring Drug MOA via Differential Module Similarity

Uncovering the mechanism of action (MOA) of molecules is a pivotal aspect of drug discovery. Current methods, which rely on gene signatures or structural similarities to predict MOA, face substantial challenges, including the intricacies of gene expression and "Activity cliffs." To overcome these hurdles, we propose a novel approach named Drug Differential Modular Similarity (Drug-DMsim), which is designed to model the effects of drugs on the gene regulatory network (GRN) and infer MOAs from known drugs. This approach involves: (1) employing mutual information and partial correlation to independently reconstruct GRNs, (2) generating differential modularity scores to quantify the division strength of a GRN into distinct modules, and (3) utilizing a dimensionality reduction technique to map molecules onto a 2D space, facilitating the identification of patterns and clusters, and enhancing the interpretability and analysis of relationships between different molecules. By applying the proposed approach to LINCS datasets, we identified potential new drug targets. This novel approach advances our understanding of the molecular mechanisms of drugs and enables faster drug discovery. 

Keywords

Drug discovery

mechanism of action (MOA)

gene regulatory network (GRN)

Differential modularity

Dimensionality reduction

LINCS 

Co-Author(s)

Komlan Atitey, National Institute of Environmental Health Science (NIEHS)
Benedict Anchang, NIEHS

First Author

Jiaqi Li, National Institute of Environmental Health Sciences

Presenting Author

Jiaqi Li, National Institute of Environmental Health Sciences

46: Extending Sparse CCA for Multi-Population, Multi-Feature Integration

Sparse canonical correlation analysis (SCCA) identifies sparse linear combinations between two sets of features that are highly correlated with each other. While multiple SCCA methods extend this framework to more than two datasets, they assume measurements of different features within the same population. Here, we propose an extension of SCCA designed for settings with four data matrices derived from two distinct populations, each with two different feature sets. The correlation maximization problem is reframed as a minimization problem and the original canonical weights are decomposed into two separate components that capture the shared and unique variance for each dataset. Via simulations, we demonstrate the improved performance of our method to recover the true canonical weights in comparison to naïve methods that disregard either the shared or unique components. For real data analysis, we apply our method to integrate two single-cell multiomic datasets of peripheral blood mononuclear cells with simultaneous measures of both RNA expression and chromatin accessibility, benchmarking its performance against widely used single-cell integration pipelines such as Seurat and Signac. 

Keywords

Sparse Canonical Correlation Analysis

Data Integration

Variance Decomposition

Single-Cell Multiomics 

Co-Author(s)

Quefeng Li, University of North Carolina Chapel Hill
Yuchao Jiang, Texas A&M University

First Author

Renee Ge

Presenting Author

Renee Ge

48: Inference of Heterogeneous Effects in Single-cell Genetic Perturbation Screens

Recent single-cell CRISPR screening experiments have combined the advances of genetic editing and single-cell technologies, leading to transcriptome-scale readouts of responses to perturbations at single-cell resolution. An outstanding question is how to efficiently identify heterogeneous effects of perturbations using these technologies. Here we present CausalPerturb, which leverages tools in causal analysis to dissect the heterogeneous landscape of perturbation effects. CausalPerturb disentangles transcriptome changes introduced by perturbations from those reflecting inherent cell-state variations. It provides nonparametric inferences of perturbation effects, enabling a range of downstream tasks including genetic interaction analysis, perturbation clustering and prioritization. We evaluated CausalPerturb via simulation and real datasets, and demonstrated its competence in characterizing latent confounding factors and discerning heterogeneous perturbation effects. The application of CausalPerturb unraveled novel genetic interactions between erythroid differentiation drivers. In particular, it pinpointed the role of the synergistic interaction between CBL and CNN1 in the S phase. 

Keywords

single-cell RNA-seq

genetic perturbation

causal inference

heterogeneous effects

deep learning 

Co-Author

Lin Hou, Tsinghua University

First Author

Zichu Fu

Presenting Author

Zichu Fu

49: Interpretable Ordinal Analysis for Complex Designs in Cell and Molecular Biology

Visual scoring is widely used in biomedical research to translate complex biological traits into ordered datasets suitable for hypothesis testing. Although advanced statistical methods exist for analyzing ordered data, use of ordinal methods by researchers remains limited. Parameter estimates from ordinal regression models, such as odds ratios or differences in probits, can hinder adoption due to their interpretive complexity. Recently, summary measures for ordinal regression models have been proposed to improve interpretability. In this work, we demonstrate the application of the γ (gamma) and ∆ (delta) ordinal superiority measures to more complex experimental designs, including interactions and multicategorical explanatory variables. Using an example dataset on cellular stress response phenotypes, we illustrate how these measures can be utilized in complex experimental designs to yield clear, meaningful interpretations of ordinal regression analyses. By demonstrating real-world applicability, this work provides a practical resource for biological researchers working with ordered response data and promotes broader adoption of ordinal regression techniques in biomedical studies. 

Keywords

Ordinal data

Ordinal Regression

Cumulative Link Models

Interaction Terms

Proportional Odds

Ordinal Superiority Measure 

Co-Author

Jeffrey Lewis, University of Arkansas

First Author

Carson Stacy, University of Arkansas

Presenting Author

Carson Stacy, University of Arkansas

50: Joint FDR Control Under Multiple Conditions

Integrating information across correlated conditions can improve statistical power by utilizing shared underlying mechanisms. Here, we are concerned with the problem of identifying which variables, among a large number of them, respond to two different conditions. Rather than treating it as two separate multiple comparisons problems, we propose to jointly estimate three proportions: the proportion of variables responding to each of the two conditions and the proportion responding to both conditions, a scenario not uncommon in biological sciences. By utilizing the shared information, our method achieves higher statistical power. The advantage of our method will be illustrated using two examples: (1) identifying genes whose expression levels in the brain are altered by radiation exposure but restored by a treatment designed to mitigate the harm caused by radiation therapy, and (2) detecting DNA variants associated with a psychometric disorder using information from a related disorder. 

Keywords

statistical power

false discovery rate

gene expression analysis

high-dimension 

Co-Author

Zhaoxia Yu, University of California, Irvine

First Author

Sara Tyo

Presenting Author

Sara Tyo

51: Likelihood-based inference of migration surfaces

In this work, we derive a method for visualizing spatial population structure using inverse instantaneous coalescent rate (IICR) curves. Unlike traditional approaches, such as EEMS, which model genetic variation as a function of migration rates and approximate its expectation using resistance distance, our method introduces a fundamentally different perspective by focusing on the coalescent process. The IICR curve quantifies the rate at which lineages coalesce as a function of time, providing a framework for inferring population structure. Our approach is based on a stepping-stone model and we model the relationship between pairs of samples as independent Markov processes with an extended joint state space that accounts for coalescence. By utilizing efficient procedures to compute the matrix exponential, we derive the distribution of coalescent times and expected IICR curves with high computational efficiency. This enables us to infer migration surfaces and visualize population structure. 

Keywords

migration surface

demographic inference

population genetics 

Co-Author

Jonathan Terhorst, University of Michigan

First Author

Jiatong Liang

Presenting Author

Jiatong Liang

52: M6A Peak Calling Accounting for Sequencing Bias Across Regions and Samples

N6-Methyladenosine (m6A) is the most abundant type of mRNA methylation and is most widely measured by methylated RNA immunoprecipitation sequencing (MeRIP-seq). In MeRIP-seq, an immunoprecipitation (IP) sample and a pairing control (input) sample are sequenced for each biological sample. Methylated regions are identified as peaks showing increased counts in the IP sample versus the input. We report that technical bias in sequencing can vary substantially in the IP and input samples depending on the local sequence context. Current sequencing depth-based normalization does not appropriately account for the varying technical bias along the transcriptome and leads to inaccurate identification of m6A regions. We describe a method to estimate a local size factor that reflects the RNA sequence context and show that peak calling using these region-specific size factors identifies more accurate peak regions. 

Keywords

transcription

RNA methylation

m6A

MeRIP-seq 

Co-Author(s)

ZHENXING GUO
Zhaohui Qin, Emory University
Zhijin Wu, Brown University

First Author

Lanyu Zhang

Presenting Author

Lanyu Zhang

53: Nonparametric Denoising of Microbiome Metagenomics Data

We propose a nonparametric method to denoise microbiome metagenomics sequencing count matrices. The goal of denoising is to recover the non-zero expected abundances of rare taxa and reduce the variance of prevalent taxa. The count matrices are dichotomized into a series of binary matrices given a sequence of thresholds. We estimate the probability of each count matrix entry being larger than each threshold by taking products of conditional probabilities. We develop a novel matrix factorization algorithm for the low-rank representation of conditional probabilities. We calculate the denoised count based on the empirical distribution formed by the estimated probabilities. Simulations show that our method is better than parametric competitors at recovering accurate microbiome compositions. Our denoising method can improve downstream analyses such as training prediction models and microbiome network analysis. 

Keywords

Microbiome metagenomics

Denoise

Binarization

Matrix factorization

Nonparametric 

Co-Author

Gen Li, University of Michigan

First Author

Mukai Wang

Presenting Author

Mukai Wang

54: Optimal Gene Panel Selection for Targeted Spatial Transcriptomics Experiments

Spatial transcriptomics is an emerging and transformative technique that provides high-resolution insights into gene expression patterns across diverse cell populations. However, because most single-cell resolution spatial profiling methods can only measure a limited set of genes, it is crucial to select a gene panel that optimally captures the biological information. Methods for optimal gene panel design are still lacking. Here, we introduce a novel method, optimal reconstruction genes selection for spatial transcriptomics (ReconST), incorporating a specifically designed autoencoder model to identify a minimal yet highly informative set of genes. By training our model on single-cell RNA sequencing (scRNA-seq) data, we show that this selected gene panel optimally reconstructs the full transcriptome. We validate our approach on paired scRNAseq data and MERFISH data, demonstrating improved reconstruction accuracy and a clear representation of spatial patterns. ReconST provides a practical and explainable framework for optimal gene panel selection, advancing the use of spatial transcriptomics to deepen our understanding of gene expression in tissue contexts. 

Keywords

Spatial Transcriptomics

Gene Panel Selection

Self-supervised learning

Deep learning

Regularization 

Co-Author(s)

Luyang Fang, University of Georgia
Wenxuan Zhong, University of Georgia
Guo-Cheng Yuan, Dana-Farber Cancer Institute
Ping Ma, University of Georgia

First Author

Haoran Lu, University of Georgia

Presenting Author

Haoran Lu, University of Georgia

55: Raman Spectra using wavelet-based ANOVA: Uncovering Dietary-Gene Spectral Components in Mice

We explored the effect of genotype and dose on the reaction of mice when exposed to different compounds present in various foods. To do so, Raman spectra of mice were obtained at baseline (prior to exposure) and at least two occasions post-exposure. As a first step, we fitted a functional ANOVA (FANOVA) model to the spectral responses. Challenges with this type of data include the presence of long-range dependence and high-dimensionality. To address this, we transformed the discretized FANOVA model to the wavelet domain, decorrelating and regularizing the inputs while preserving the model structure. Soft-thresholding based on median absolute deviation is used for noise reduction, and inverse wavelet transform reconstructs refined estimates in the original domain. This wavelet-based ANOVA (WANOVA) enhances the interpretability of Raman spectral data, offering a novel framework for detecting food compound interactions with genetic variations, with potential implications for personalized nutrition and biomedical research. 

Keywords

Raman Spectroscopy

Wavelet Transform

WANOVA

FANOVA 

Co-Author(s)

Brani Vidakovic, Texas A&M University, Statistics Department
Patrick Stover, Texas A&M University
Regan Bailey, Texas A&M University
Alicia Carriquiry, Iowa State University

First Author

Jaeseon Lee, Texas A&M University

Presenting Author

Jaeseon Lee, Texas A&M University

57: SpaDiff: Denoising for Sequence-based Spatial Transcriptomics via Diffusion Process

Spatial transcriptomics is revolutionizing our understanding of complex biological systems by enabling the analysis of RNA transcriptomes with precise spatial resolution. The sequence-based spatial transcriptomics technology, such as Visium from 10X Genomics, provides critical insights into tissue architecture and cellular interactions within their native microenvironments. However, a significant challenge in spatial transcriptomics is the phenomenon of spot-swapping, where RNA molecules are not confined to their original locations on the tissue slide, introducing noise and inaccuracies into the data. To solve this problem, we propose SpaDiff which models spot-swapping via a diffusion process model. By applying Langevin MCMC, our model emulates the RNA molecules' diffusion and reverse diffusion processes, offering a more effective and generalizable solution to data denoising in spatial transcriptomics. By applying SpaDiff to multiple synthetic and real datasets, we show that it can not only retain the original UMI counts but also enhance the spatial specificity of biomarker gene expression, thereby improving the accuracy of subsequent analyses and the interpretation of biological p 

Keywords

Sequence-based Spatial Transcriptomics

Data Denoising

Diffusion Process

Score Function

Langevin MCMC 

Co-Author(s)

Yongkai Chen
Luyang Fang, University of Georgia
Guocheng Yuan, Icahn School of Medicine at Mount Sinai
Wenxuan Zhong, University of Georgia
Ping Ma, University of Georgia

First Author

Jiazhang Cai

Presenting Author

Jiazhang Cai

58: Sparse Bayesian Partially Identified Models Enhance Differential Abundance and Expression Analyses

In genomics, differential expression and abundance analyses are challenging due to the compositional structure of the data. These data only provide information about the relative abundance of taxa or the relative expression of genes and not absolute amounts. While many authors have approached this problem through data normalizations, we have shown that such methods are flawed as they imply strong, often implausible assumptions about total microbial load or total gene expression. Even slight errors in these assumptions often lead Type-I and/or II error rates in excess of 70%. Here, we show similar flaws with currently available sparse estimators, which attempt to overcome compositional problems by assuming few taxa (or genes) are changing in abundance (or expression) between conditions. Instead, we show that a novel sparse Bayesian Partially Identified Model overcomes the limitations of existing methods by accounting for uncertainty in the sparsity assumptions themselves. We prove the consistency of our novel estimator. Moreover, through both simulated and real data analysis, we show that our methods can drastically reduce Type-I and Type-II errors compared to existing methods. 

Keywords

Compositional Data

Bayesian Partially Identified Model

Sparsity Assumption

Type-I and Type-II Errors

Uncertainty Quantification 

Co-Author

Justin Silverman, Penn State University

First Author

Won Gu

Presenting Author

Won Gu

59: Stage-Specific BIN1 Effects Link Tau to Preclinical Functional Connectivity in Alzheimer's Disease

We model mediation of BIN1 genetic risk (rs6733839) on functional connectivity (FC) through tau pathology in Alzheimer's disease, comparing cognitively normal (CN, n=104) and mild cognitive impairment (MCI, n=101) groups. Using baseline data from ADNI with temporally ordered biomarkers (preceding imaging), we identified FC components (IC1–IC10) via ICA and found IC5 (Dorsal Attention-Default Mode/Visual networks) associated with Aβ (p = 0.00027) and group-dependent tau effects (IC5×Group interaction: p = 0.002). We then tested SNP→tau→IC5 paths using multi-group mediation, allowing group-specific slopes. In CN, the BIN1 risk allele (T) linked to reduced tau (β=−0.12, p=0.03) and marginal indirect preservation of IC5 (β=0.16, p=0.08). In MCI, direct SNP effects dominated (β=−0.39, p=0.005), with no tau mediation. Paradoxically, the T allele associated with lower tau (β=−0.11, p=0.04) despite being an AD risk variant, suggesting stage-dependent BIN1 isoform effects (early clearance vs late aggregation). Temporal precedence (biomarkers pre-imaging) strengthens causal plausibility. Results suggest IC5 as a preclinical resilience marker and highlight shifting pathways. 

Keywords

multi-group SEM

mediation analysis

Alzheimer’s disease

functional connectivity

BIN1

tau pathology 

Co-Author(s)

Rui Chen, Vanderbilt University
Ke Xu, Vanderbilt University Medical Center
Xue Zhong, Vanderbilt University Medical Center
Yuting Tan, Vanderbilt University
Anshul Tiwari, Vanderbilt University
Zhexing Wen, Emory University
Bingshan Li, Vanderbilt University
Hakmook Kang, Vanderbilt University

First Author

Yan Yan, Vanderbilt University

Presenting Author

Yan Yan, Vanderbilt University

60: TPClust: Temporal Profile-Guided Disease Subtyping Using High-Dimensional Omics Data

Disease subtyping using unsupervised clustering of omics data often results in subtypes with limited clinical relevance, while existing supervised methods are not suitable for longitudinal data. To address this, we developed a novel latent generative model for disease subtyping that integrates longitudinal clinical data and high-dimensional omics data. Our method comprises two components: a multinomial logistic regression using omics to define subtypes and a longitudinal association model capturing time-varying relationships between clinical variables. These are integrated via a mixture regression. We include omics feature selection and smooth estimation of time-varying associations into the model fitting. A multiplier bootstrap was used to construct confidence intervals for time-varying effects. We validated our method through simulations and applied it to 1,020 adults from the Religious Orders Study and Memory and Aging Project (ROS/MAP)-two longitudinal cohorts for investigating Alzheimer's Disease (AD). Our approach captures the time-varying effects of AD risk factors and enables accurate inference on these effects, leading to the detection of clinically meaningful subtypes. 

Keywords

Disease subtyping

Machine learning

Semi-parametric model

High-dimensional omics

Longitudinal data

Supervised clustering 

Co-Author(s)

Badri Vardarajan, Columbia University
Philip De Jager, Columbia University
David Bennett, Rush Alzheimer Disease Center
Yuanjia Wang, Columbia University
Annie Lee, Columbia University Irving Medical Center

First Author

Boyi Hu, Columbia University

Presenting Author

Boyi Hu, Columbia University