Tuesday, Aug 5: 2:00 PM - 3:50 PM
4122
Contributed Posters
Music City Center
Room: CC-Hall B
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
Biomarkers are measurable indicators of biological processes and have wide biomedical applications including disease screening and prognosis prediction. Candidate biomarkers can be screened in high-throughput settings, which allow simultaneous measurements of a large number of molecules. For binary biomarkers, the ability to detect a molecule may be hindered by the presence of background noise and the variable signal strength, which lower the sensitivity to a different extent for different target molecules in a sample-specific manner. This heterogeneity in detection sensitivity is often overlooked and leads to an inflated false positive rate. We propose a novel sensitivity adjusted likelihood-ratio test (SALT), which properly controls the false positives and is more powerful than the unadjusted approach. We show that sample-and-feature-specific detection sensitivity can be well estimated from NanoString nCounter data, and using the estimated sensitivity in SALT results in improved biomarker screening.
Keywords
High-throughput biomarker screening
Binary biomarker
Detection sensitivity
Sample-and-feature-specific sensitivity
Hypothesis testing
NanoString nCounter
Introduction: Rare genetic variation is considered a potential source of heritability in individuals with sporadic Alzheimer's Disease and related dementias (ADRD). The STAAR framework leverages multiple functional annotations of genetic variants and combines association statistics from multiple variant aggregation-based methods, including burden, SKAT, and ACAT-V, into a single measure of significance.
Method: Using whole genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP), we comprehensively examined the association of rare genetic variation with ADRD in 23,455 individuals (37% ADRD cases) and with cognitively healthy elder status in 13,292 individuals (13% cognitively healthy elders) from diverse populations via the STAAR framework.
Results: We identified several genes significantly associated with ADRD or cognitively healthy status. However, our analysis revealed several limitations within the STAAR framework incorporating ultra-rare variants with dichotomous outcomes. To enhance the robustness of the framework, we proposed several computational refinements, including creating a burden of ultra-rare variants and employing more precise annotations to match with expected mechanism. After implementing the proposed modifications, the association with ADRD for ZNF200 was no longer statistically significant (α=1x10-7), while TBX19, PLXNB2, CARD11, and LINC01880 remained significantly associated with cognitively healthy status.
Conclusion: We identified and addressed the computational limitations in the STAAR framework that could lead to potential spurious results for ultra-rare variant aggregates with an extremely low cumulative minor allele count. Our proposed refinements produced more robust results for associations with rare variants in the context of dichotomous outcomes.
Keywords
Rare varaint analysis
STAAR framework
Alzheimer's disease
Co-Author(s)
Nancy Heard-Costa, Department of Medicine, Boston University School of Medicine;NHLBI Framingham Heart Study
Andy Rampersaud, Research Computing Services, Information Services & Technology, Boston University
Eden Martin, University of Miami-Miami Institute of Human Genomics
Adam Naj, Department of Biostatistics, Epidemiology, and Informatics, Department of Pathology and Laboratory
Bilcag Akgun, John P Hussman Institute for Human Genomics
Brian Kunkle, John P Hussman Institute for Human Genomics; John T Macdonald Department of Human Genetics
Gina Peloso
Anita DeStefano, Department of Biostatistics, Boston University School of Public Health
Xihao Li, University of North Carolina at Chapel Hill
Seung Hoan Choi, Department of Biostatistics, Boston University School of Public Health
First Author
Dongyu Wang, Department of Biostatistics, Boston University School of Public Health
Presenting Author
Dongyu Wang, Department of Biostatistics, Boston University School of Public Health
Large biobank studies, such as the UK Biobank, provide us with unprecedented opportunities to predict various phenotypes with their rich genome-wide association studies (GWAS) data collected from massive populations. The adoption of linear mixed models (LMMs) to predict phenotypes was a significant milestone and a major success in the history of GWAS. Nevertheless, the classic LMM-based methods for GWAS data often fail to account for the dependence structure between single nucleotide polymorphisms (SNPs). Meanwhile, recently, deep learning has demonstrated remarkable success in computer vision, protein structure prediction and functional genomics. Deep learning is able to model complex non-linear relationships and can exploit dependent structure among features. Therefore, it is of great interest to compare the predictive capabilities between classic LMM-based methods and deep learning models for GWAS data. Here, we systematically compare the performance of LMM-based methods and deep learning models in predicting a dozen phenotypes using the UK Biobank data and discuss the strengths and limitations of both approaches.
Keywords
Genome-Wide Association Studies (GWAS)
Linear Mixed-Effect Models
Deep Learning
Biobank
Many studies of human microbiome epidemiology have focused on the effects of health outcomes and exposures on the microbiome or the effects of microbiome on health outcomes. However, there's increasing interest in understanding complex relationships where exposures alter microbiome composition, which thereby affects the health outcomes (i.e., "mediates" exposure effect on health). Such hypotheses can be tested by statistical mediation analysis, but typical methods are not appropriate for microbiome data due to zero-inflation, compositionality, and high-dimensionality. Using realistic simulated microbiome data, we compared the performance of (1) low-dimensional mediation methods, (2) high-dimensional, non-compositional mediation, and (3) specialized methods for microbiome under differing circumstances. We further compared these methods in two real-world datasets assessing the effect of diet on cardiometabolic disease. We make recommendations on best methods for total direct effect and total/component indirect effects. Notably, no one method performed the best in all tests, indicating the nuance in microbiome mediation analyses and the need for new methods.
Keywords
human microbiome
mediation analysis
microbiome epidemiology
metagenomics
Uncovering the mechanism of action (MOA) of molecules is a pivotal aspect of drug discovery. Current methods, which rely on gene signatures or structural similarities to predict MOA, face substantial challenges, including the intricacies of gene expression and "Activity cliffs." To overcome these hurdles, we propose a novel approach named Drug Differential Modular Similarity (Drug-DMsim), which is designed to model the effects of drugs on the gene regulatory network (GRN) and infer MOAs from known drugs. This approach involves: (1) employing mutual information and partial correlation to independently reconstruct GRNs, (2) generating differential modularity scores to quantify the division strength of a GRN into distinct modules, and (3) utilizing a dimensionality reduction technique to map molecules onto a 2D space, facilitating the identification of patterns and clusters, and enhancing the interpretability and analysis of relationships between different molecules. By applying the proposed approach to LINCS datasets, we identified potential new drug targets. This novel approach advances our understanding of the molecular mechanisms of drugs and enables faster drug discovery.
Keywords
Drug discovery
mechanism of action (MOA)
gene regulatory network (GRN)
Differential modularity
Dimensionality reduction
LINCS
Co-Author(s)
Komlan Atitey, National Institute of Environmental Health Science (NIEHS)
Benedict Anchang, NIEHS
First Author
Jiaqi Li, National Institute of Environmental Health Sciences
Presenting Author
Jiaqi Li, National Institute of Environmental Health Sciences
Sparse canonical correlation analysis (SCCA) identifies sparse linear combinations between two sets of features that are highly correlated with each other. While multiple SCCA methods extend this framework to more than two datasets, they assume measurements of different features within the same population. Here, we propose an extension of SCCA designed for settings with four data matrices derived from two distinct populations, each with two different feature sets. The correlation maximization problem is reframed as a minimization problem and the original canonical weights are decomposed into two separate components that capture the shared and unique variance for each dataset. Via simulations, we demonstrate the improved performance of our method to recover the true canonical weights in comparison to naïve methods that disregard either the shared or unique components. For real data analysis, we apply our method to integrate two single-cell multiomic datasets of peripheral blood mononuclear cells with simultaneous measures of both RNA expression and chromatin accessibility, benchmarking its performance against widely used single-cell integration pipelines such as Seurat and Signac.
Keywords
Sparse Canonical Correlation Analysis
Data Integration
Variance Decomposition
Single-Cell Multiomics
Recent single-cell CRISPR screening experiments have combined the advances of genetic editing and single-cell technologies, leading to transcriptome-scale readouts of responses to perturbations at single-cell resolution. An outstanding question is how to efficiently identify heterogeneous effects of perturbations using these technologies. Here we present CausalPerturb, which leverages tools in causal analysis to dissect the heterogeneous landscape of perturbation effects. CausalPerturb disentangles transcriptome changes introduced by perturbations from those reflecting inherent cell-state variations. It provides nonparametric inferences of perturbation effects, enabling a range of downstream tasks including genetic interaction analysis, perturbation clustering and prioritization. We evaluated CausalPerturb via simulation and real datasets, and demonstrated its competence in characterizing latent confounding factors and discerning heterogeneous perturbation effects. The application of CausalPerturb unraveled novel genetic interactions between erythroid differentiation drivers. In particular, it pinpointed the role of the synergistic interaction between CBL and CNN1 in the S phase.
Keywords
single-cell RNA-seq
genetic perturbation
causal inference
heterogeneous effects
deep learning
Visual scoring is widely used in biomedical research to translate complex biological traits into ordered datasets suitable for hypothesis testing. Although advanced statistical methods exist for analyzing ordered data, use of ordinal methods by researchers remains limited. Parameter estimates from ordinal regression models, such as odds ratios or differences in probits, can hinder adoption due to their interpretive complexity. Recently, summary measures for ordinal regression models have been proposed to improve interpretability. In this work, we demonstrate the application of the γ (gamma) and ∆ (delta) ordinal superiority measures to more complex experimental designs, including interactions and multicategorical explanatory variables. Using an example dataset on cellular stress response phenotypes, we illustrate how these measures can be utilized in complex experimental designs to yield clear, meaningful interpretations of ordinal regression analyses. By demonstrating real-world applicability, this work provides a practical resource for biological researchers working with ordered response data and promotes broader adoption of ordinal regression techniques in biomedical studies.
Keywords
Ordinal data
Ordinal Regression
Cumulative Link Models
Interaction Terms
Proportional Odds
Ordinal Superiority Measure
Integrating information across correlated conditions can improve statistical power by utilizing shared underlying mechanisms. Here, we are concerned with the problem of identifying which variables, among a large number of them, respond to two different conditions. Rather than treating it as two separate multiple comparisons problems, we propose to jointly estimate three proportions: the proportion of variables responding to each of the two conditions and the proportion responding to both conditions, a scenario not uncommon in biological sciences. By utilizing the shared information, our method achieves higher statistical power. The advantage of our method will be illustrated using two examples: (1) identifying genes whose expression levels in the brain are altered by radiation exposure but restored by a treatment designed to mitigate the harm caused by radiation therapy, and (2) detecting DNA variants associated with a psychometric disorder using information from a related disorder.
Keywords
statistical power
false discovery rate
gene expression analysis
high-dimension
In this work, we derive a method for visualizing spatial population structure using inverse instantaneous coalescent rate (IICR) curves. Unlike traditional approaches, such as EEMS, which model genetic variation as a function of migration rates and approximate its expectation using resistance distance, our method introduces a fundamentally different perspective by focusing on the coalescent process. The IICR curve quantifies the rate at which lineages coalesce as a function of time, providing a framework for inferring population structure. Our approach is based on a stepping-stone model and we model the relationship between pairs of samples as independent Markov processes with an extended joint state space that accounts for coalescence. By utilizing efficient procedures to compute the matrix exponential, we derive the distribution of coalescent times and expected IICR curves with high computational efficiency. This enables us to infer migration surfaces and visualize population structure.
Keywords
migration surface
demographic inference
population genetics
N6-Methyladenosine (m6A) is the most abundant type of mRNA methylation and is most widely measured by methylated RNA immunoprecipitation sequencing (MeRIP-seq). In MeRIP-seq, an immunoprecipitation (IP) sample and a pairing control (input) sample are sequenced for each biological sample. Methylated regions are identified as peaks showing increased counts in the IP sample versus the input. We report that technical bias in sequencing can vary substantially in the IP and input samples depending on the local sequence context. Current sequencing depth-based normalization does not appropriately account for the varying technical bias along the transcriptome and leads to inaccurate identification of m6A regions. We describe a method to estimate a local size factor that reflects the RNA sequence context and show that peak calling using these region-specific size factors identifies more accurate peak regions.
Keywords
transcription
RNA methylation
m6A
MeRIP-seq
We propose a nonparametric method to denoise microbiome metagenomics sequencing count matrices. The goal of denoising is to recover the non-zero expected abundances of rare taxa and reduce the variance of prevalent taxa. The count matrices are dichotomized into a series of binary matrices given a sequence of thresholds. We estimate the probability of each count matrix entry being larger than each threshold by taking products of conditional probabilities. We develop a novel matrix factorization algorithm for the low-rank representation of conditional probabilities. We calculate the denoised count based on the empirical distribution formed by the estimated probabilities. Simulations show that our method is better than parametric competitors at recovering accurate microbiome compositions. Our denoising method can improve downstream analyses such as training prediction models and microbiome network analysis.
Keywords
Microbiome metagenomics
Denoise
Binarization
Matrix factorization
Nonparametric
Spatial transcriptomics is an emerging and transformative technique that provides high-resolution insights into gene expression patterns across diverse cell populations. However, because most single-cell resolution spatial profiling methods can only measure a limited set of genes, it is crucial to select a gene panel that optimally captures the biological information. Methods for optimal gene panel design are still lacking. Here, we introduce a novel method, optimal reconstruction genes selection for spatial transcriptomics (ReconST), incorporating a specifically designed autoencoder model to identify a minimal yet highly informative set of genes. By training our model on single-cell RNA sequencing (scRNA-seq) data, we show that this selected gene panel optimally reconstructs the full transcriptome. We validate our approach on paired scRNAseq data and MERFISH data, demonstrating improved reconstruction accuracy and a clear representation of spatial patterns. ReconST provides a practical and explainable framework for optimal gene panel selection, advancing the use of spatial transcriptomics to deepen our understanding of gene expression in tissue contexts.
Keywords
Spatial Transcriptomics
Gene Panel Selection
Self-supervised learning
Deep learning
Regularization
We explored the effect of genotype and dose on the reaction of mice when exposed to different compounds present in various foods. To do so, Raman spectra of mice were obtained at baseline (prior to exposure) and at least two occasions post-exposure. As a first step, we fitted a functional ANOVA (FANOVA) model to the spectral responses. Challenges with this type of data include the presence of long-range dependence and high-dimensionality. To address this, we transformed the discretized FANOVA model to the wavelet domain, decorrelating and regularizing the inputs while preserving the model structure. Soft-thresholding based on median absolute deviation is used for noise reduction, and inverse wavelet transform reconstructs refined estimates in the original domain. This wavelet-based ANOVA (WANOVA) enhances the interpretability of Raman spectral data, offering a novel framework for detecting food compound interactions with genetic variations, with potential implications for personalized nutrition and biomedical research.
Keywords
Raman Spectroscopy
Wavelet Transform
WANOVA
FANOVA
Spatial transcriptomics is revolutionizing our understanding of complex biological systems by enabling the analysis of RNA transcriptomes with precise spatial resolution. The sequence-based spatial transcriptomics technology, such as Visium from 10X Genomics, provides critical insights into tissue architecture and cellular interactions within their native microenvironments. However, a significant challenge in spatial transcriptomics is the phenomenon of spot-swapping, where RNA molecules are not confined to their original locations on the tissue slide, introducing noise and inaccuracies into the data. To solve this problem, we propose SpaDiff which models spot-swapping via a diffusion process model. By applying Langevin MCMC, our model emulates the RNA molecules' diffusion and reverse diffusion processes, offering a more effective and generalizable solution to data denoising in spatial transcriptomics. By applying SpaDiff to multiple synthetic and real datasets, we show that it can not only retain the original UMI counts but also enhance the spatial specificity of biomarker gene expression, thereby improving the accuracy of subsequent analyses and the interpretation of biological p
Keywords
Sequence-based Spatial Transcriptomics
Data Denoising
Diffusion Process
Score Function
Langevin MCMC
In genomics, differential expression and abundance analyses are challenging due to the compositional structure of the data. These data only provide information about the relative abundance of taxa or the relative expression of genes and not absolute amounts. While many authors have approached this problem through data normalizations, we have shown that such methods are flawed as they imply strong, often implausible assumptions about total microbial load or total gene expression. Even slight errors in these assumptions often lead Type-I and/or II error rates in excess of 70%. Here, we show similar flaws with currently available sparse estimators, which attempt to overcome compositional problems by assuming few taxa (or genes) are changing in abundance (or expression) between conditions. Instead, we show that a novel sparse Bayesian Partially Identified Model overcomes the limitations of existing methods by accounting for uncertainty in the sparsity assumptions themselves. We prove the consistency of our novel estimator. Moreover, through both simulated and real data analysis, we show that our methods can drastically reduce Type-I and Type-II errors compared to existing methods.
Keywords
Compositional Data
Bayesian Partially Identified Model
Sparsity Assumption
Type-I and Type-II Errors
Uncertainty Quantification
We model mediation of BIN1 genetic risk (rs6733839) on functional connectivity (FC) through tau pathology in Alzheimer's disease, comparing cognitively normal (CN, n=104) and mild cognitive impairment (MCI, n=101) groups. Using baseline data from ADNI with temporally ordered biomarkers (preceding imaging), we identified FC components (IC1–IC10) via ICA and found IC5 (Dorsal Attention-Default Mode/Visual networks) associated with Aβ (p = 0.00027) and group-dependent tau effects (IC5×Group interaction: p = 0.002). We then tested SNP→tau→IC5 paths using multi-group mediation, allowing group-specific slopes. In CN, the BIN1 risk allele (T) linked to reduced tau (β=−0.12, p=0.03) and marginal indirect preservation of IC5 (β=0.16, p=0.08). In MCI, direct SNP effects dominated (β=−0.39, p=0.005), with no tau mediation. Paradoxically, the T allele associated with lower tau (β=−0.11, p=0.04) despite being an AD risk variant, suggesting stage-dependent BIN1 isoform effects (early clearance vs late aggregation). Temporal precedence (biomarkers pre-imaging) strengthens causal plausibility. Results suggest IC5 as a preclinical resilience marker and highlight shifting pathways.
Keywords
multi-group SEM
mediation analysis
Alzheimer’s disease
functional connectivity
BIN1
tau pathology
Co-Author(s)
Rui Chen, Vanderbilt University
Ke Xu, Vanderbilt University Medical Center
Xue Zhong, Vanderbilt University Medical Center
Yuting Tan, Vanderbilt University
Anshul Tiwari, Vanderbilt University
Zhexing Wen, Emory University
Bingshan Li, Vanderbilt University
Hakmook Kang, Vanderbilt University
First Author
Yan Yan, Vanderbilt University
Presenting Author
Yan Yan, Vanderbilt University
Disease subtyping using unsupervised clustering of omics data often results in subtypes with limited clinical relevance, while existing supervised methods are not suitable for longitudinal data. To address this, we developed a novel latent generative model for disease subtyping that integrates longitudinal clinical data and high-dimensional omics data. Our method comprises two components: a multinomial logistic regression using omics to define subtypes and a longitudinal association model capturing time-varying relationships between clinical variables. These are integrated via a mixture regression. We include omics feature selection and smooth estimation of time-varying associations into the model fitting. A multiplier bootstrap was used to construct confidence intervals for time-varying effects. We validated our method through simulations and applied it to 1,020 adults from the Religious Orders Study and Memory and Aging Project (ROS/MAP)-two longitudinal cohorts for investigating Alzheimer's Disease (AD). Our approach captures the time-varying effects of AD risk factors and enables accurate inference on these effects, leading to the detection of clinically meaningful subtypes.
Keywords
Disease subtyping
Machine learning
Semi-parametric model
High-dimensional omics
Longitudinal data
Supervised clustering