Tuesday, Aug 6: 10:30 AM - 12:20 PM
6032
Contributed Posters
Oregon Convention Center
Room: CC-Hall CD
Main Sponsor
Section on Statistics in Genomics and Genetics
Presentations
MicroRNAs (miRNAs) are promising biomarker candidates for their association with a wide range of diseases and their presence in easy-to-obtain biofluids. Since many extracellular miRNAs have concentrations that are often below or near the limit of detection, it is more appropriate to evaluate them as binary biomarkers than as continuous or count variables. Similar to other technologies, the binary detection of a miRNA molecule is influenced by technical variations, which we refer to as the sample-specific sensitivity. We propose a new likelihood ratio test that accounts for the sample-specific sensitivity and compare it to a binomial test which assumes all samples having the same sensitivity equals one. We focus on the NanoString nCounter data as an example for estimating the sample-specific sensitivities by pooling information across all features. With simulations, we demonstrate that, when the sample qualities are not balanced between comparison groups, the proposed test remains valid with stronger statistical power and controlled false discovery rate. Additionally, we provide applications of the new test procedure to publicly available nCounter data sets from the GEO database.
Keywords
Sample-Specific Sensitivity
Binary Biomarker
MicroRNA
NanoString nCounter
Statistical Test
Biomarker Identification
Abstracts
Observational studies have reported high comorbidity between type 2 diabetes (T2D), obesity, and severe COVID-19. However, the causality among T2D, obesity, and severe COVID-19 has not yet been fully validated. We performed genetic correlation and Mendelian randomization (MR) analyses to assess genetic relationships and potential causal associations of T2D and obesity with two COVID-19 outcomes: SARS-CoV-2 infection and COVID-19 severity. Our study incorporated two-sample MR, one-sample MR, and nonlinear MR analyses, utilizing summary-level and individual-level data from the GIANT and DIAGRAM consortia, and the UK Biobank. We identified a high genetic overlap between T2D and each of the COVID-19 outcomes. The two-sample MR analyses indicate that genetic liability to T2D confers a causal effect on COVID-19 severity (beta=0.1500, p=0.0012), and genetic liability to body mass index (BMI) exerts a causal effect on COVID-19 severity (beta=0.3958, p=4.36e-18). The results from the one-sample and nonlinear MR analyses suggest similar causal relationships of T2D and BMI with COVID-19 outcomes. Our analyses conclude that T2D and obesity are causal risk factors for COVID-19 severity.
Keywords
Mendelian Randomization
Causal Inference
Type 2 diabetes
Obesity
COVID-19
Abstracts
We introduce a Bayesian factor model to perform fast and interpretable fine-mapping on hundreds to thousands of traits simultaneously to identify causal genetic variants from genome wide association study (GWAS) summary statistics. Our model decomposes genetic effects into an indirect effect mediated by latent biological processes and a direct effect, where the indirect effect helps model the shared genetic origin of traits and the direct effect captures trait-specific genetic variation. Critically, our model and estimation pipeline facilitate the use of biologically informed priors, like metabolic pathway information in metabolomics or phylogenetic trees in microbiomics, which beget interpretable inference. We derive the statistical properties of our estimators by studying their asymptotic properties as the number of samples, traits, and genetic variants go to infinity, and apply our method to real metabolite GWAS summary statistics to jointly fine-map more than 700 metabolites. We show our method is powerful enough to recapitulate results from a study with 20 times our sample size, and is able to make inferences that would otherwise be impossible with current analysis pipelines.
Keywords
Multi-trait fine-mapping
Metabolite genome wide association study
Bayesian statistics
Factor analysis
Pleiotropic
Metabolomic analysis
Abstracts
One of the main aims of data modeling is to find the best classifier for new cases. For example, based on the gene expression of a new case, we can classify it as one of two groups. The high dimensionality of the dataset is the main restriction for finding an accurate and non-complex model. Therefore, the similar genes in the two groups are removed to reduce the dimension. Candidate genes are selected according to the family-wise error rate (FEWR) and used to find the best classifier. Zhang and Deng [1] proposed an additional step in removing the genes with redundant or highly correlated information before finding the best classifier. They find more effective and non-redundant genes using the Bayes error rate (BER). They used Bhattacharya bound to estimate BER because BER was not computable at that time. They show that this additional step improves classification accuracy. In this work, we improve the classification accuracy by computing exact BER [2] and using uniformly most powerful unbiased test [3] for calculating FWER.
Keywords
Bayes error rate
Microarray data
Gene selection
Classification
Permutation test
Uniformly most powerful unbiased test
Abstracts
To understand heart aging at the single-cell level, we employed single-cell dual omics (scRNA and scATAC) in non-myocytes (non-CMs) from young (3m), middle-aged (12m), and elderly (24m) mice. Non-CMs, vital in heart development, physiology, and pathology, are understudied compared to cardiomyocytes. Our analysis revealed aging response heterogeneity among non-CM cell types. Immune cells, notably macrophages and neutrophils, showed significant aging alterations, while endothelial cells displayed moderate changes. We identified distinct aging signatures within the cell type, including differential gene expression and transcription factor activity, along with motif variation. Sub-cluster analysis revealed intra-cell type heterogeneity, characterized by diverse aging patterns. The senescence-associated secretory phenotype (SASP) emerged as a key aging-related phenotype. Moreover, aging significantly influenced cell-cell communication, especially impacting a fibroblast sub-cluster, Fib.Erbb4. This study elucidates the complex cellular and molecular landscape of cardiac aging in non-CMs, highlighting their importance in heart aging and offering potential therapeutic avenues.
Keywords
Cardiac Aging
Single-cell Dual omics
Non-myocytes
Aging Heterogeneity
Senescence-Associated Secretory Phenotype (SASP)
Fibroblast Sub-Cluster
Abstracts
We propose a flexible family of Bayesian multinomial logistic-normal additive Gaussian process regression (MLN) models for estimating additive linear and non-linear effects in microbiome and gene expression studies. This family has a marginally latent matrix-t process (MLTP) form, facilitating efficient and accurate inference via a particle filter with marginal Laplace approximation. We also develop a maximum marginal likelihood estimation method for model hyperparameters. We demonstrate the efficiency and utility of these models for estimating linear and non-linear effects through analyses of real and simulated sequence count data.
Keywords
Bayesian Statistics
Nonlinear Regression
Gaussian Processes
Microbiome Data
Gene Expression Data
Abstracts
CNVs are DNA gains or losses involving ≥50 base pairs. Estimating CNV association effects requires considering a few factors, e.g., 1) variations in CNV dosage and length need to be accounted for; and 2) all CNVs in a genomic region should be jointly assessed. Here we propose a penalized regression model for CNV association analysis. We model an individual's CNVs as a piecewise constant curve to naturally capture CNV length and dosage. To jointly model all CNVs in a genomic region, we use Lasso penalty to select CNVs associated with the outcome and integrate a weighted fusion penalty to encourage similar effects of adjacent CNVs when supported by the data. Our simulations show that the proposed model can more effectively identify causal CNVs without introducing additional false positives compared to the baseline methods (Lasso and gBridge); and yield more precise effect size estimation in different simulation settings. In the real data application to identify CNVs associated with Alzheimer's Disease (AD), the CNVs identified by our methods overlap genes that are significantly enriched in pathways related to neuron structure and neuron function and yield higher predictive accuracy.
Keywords
Penalized Regression
Association
Weighted Fusion
Lasso
Effect estimation
Copy number variants
Abstracts
Co-Author(s)
Wenbin Lu, North Carolina State University
Albert Tucci
Hui Wang, Perelman School of Medicine, University of Pennsylvania
Yuhuan Cheng
Li-San Wang, Perelman School of Medicine, University of Pennsylvania
Gerard Schellenberger, Perelman School of Medicine, University of Pennsylvania
Wan-Ping Lee, Perelman School of Medicine, University of Pennsylvania
Jung-Ying Tzeng, North Carolina State University
First Author
Yaqin Si
Presenting Author
Yaqin Si
Image synthesis is an important and growing field of research fueled by rapid progress in image-based artificial intelligence and has been employed in neuroimaging, multiplexed immunofluorescence (MxIF), and imaging spatial transcriptomics. The number of publications on image synthesis has nearly tripled in the last decade, but there is no evaluation of the consequence of using it in medical research. Currently, biomedical image synthesis 1) does not include relevant clinical information, and 2) fails to provide statistical uncertainty. As postulated by multiple imputation theory, neglecting either issue can lead to invalid downstream statistical analysis. In this paper, we systematically examine these issues in state-of-the-art image synthesis algorithms with real-world imaging data. We demonstrate that 1) current imputation tools often lead to biased point estimates and anti-conservative standard errors, and 2) such issues can be alleviated by simple, post-hoc augmentation steps derived from multiple imputation literature. This work is pioneer in highlighting invalid findings on synthesized biomedical imaging data, and providing expeditious solutions.
Keywords
Statistical imaging
Machine learning
Spatial genomics
Multiple imputation
Abstracts
Probabilistic graphical models are powerful tools to infer, interpret, and visualize complex biological systems. However, most existing graphical models assume homogeneity across samples, limiting their application in heterogeneous contexts e.g. tumor and spatial heterogeneity. We propose a general and flexible Bayesian approach called Graphical Regression (GraphR) which incorporates intrinsic heterogeneity at different scales such as discrete, continuous and spatial, enables sparse network estimation at sample-specific level, has higher precision compared to existing approaches and is computationally efficient for analyses of large genomic datasets. We employ GraphR to analyze four diverse multiomic and spatial transcriptomics datasets to infer inter- and intra-sample genomic networks and delineate several novel biological discoveries. We have developed the GraphR R-package and a user-friendly Shiny App for analysis and dynamic network visualization.
Keywords
Heterogeneous graphical models
Genomics
Spatial transcriptomics
Variable selection
Variational Bayes
Abstracts
With high throughput technologies, investigators can measure genetic variations in multiple forms. New methods are needed to interrogate the relationship between genomic variations and endpoints of interest. We formerly developed POST procedure to associate gene sets/pathways with a clinical variable. We used similar dimension reduction machinery on each form of omics data at a locus/gene to collectively test the association between multiform genomic data and an endpoint of interest. The probe level signals of each form of omics data at a locus/gene are first projected to an orthogonal subspace and the corresponding eigenvalues are rescaled to sum to 1 for each form of omics data. The projected data are then subjected to a parametric association test to obtain z-statistics. The test statistic is defined as weighted sum squares of individual z-statistics. The correlation structure of z-statistics is approximated by bootstrap resampling and a generalized χ2 distribution approximates the p-value. We investigated the performance in simulation studies and applied the proposed method to a gene profiling and methylation data set of 187 pediatric AML from NCI TARGET.
Keywords
Genomics
Gene profiling
Orthogonal projection
Data integration
Abstracts
Polygenic risk scores (PRS) predict genetic risk for complex traits by tallying cumulative risk alleles at genetic markers, using estimates from genome-wide association studies (GWAS). Biases in GWAS, like Winner's Curse where effect sizes of significant variants are overestimated, impact downstream analyses. This study assesses Winner's Curse impact on PRS and explores potential improvements by adjusting effect sizes for this bias. Using simulated GWAS summary statistics and genotype data for a million markers in linkage disequilibrium (LD), three PRS sets are derived with varying markers using clumping and p-value thresholding. PRS performance is compared between original and Winner's Curse-adjusted summary statistics, employing methods like Empirical Bayes and FDR Inverse Quantile Transformation for correction. Adding more markers in the original PRS significantly increases variance, whereas the adjusted PRS variance is more controlled, especially with over 100 markers. This study demonstrates Winner's Curse impact on PRS and underscores that adjusting for this bias enhances reliability, especially with over 100 markers.
Keywords
statistical genetics
polygenic risk scores
Abstracts
Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variants as instrumental variables. We develop a novel MR framework for mediation analysis with
genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework efficiently integrates information stored in three independent GWAS summary data and mitigates not only the commonly encountered winner's curse and measurement error bias in MR, but also the loser's
curse and the imperfect IV selection issue, which are tailored to mediation analysis. Our method is also immune to measurement error bias as the estimating equations are carefully adjusted by incorporating estimated conditional variances of the Rao-Blackwellized association effects. Through our theoretical investigations, we show that the proposed method delivers consistent and asymptotically normally distributed effect estimates.
Keywords
Inverse Variance Weighting
Post-selection Inference
Instrumental Variable
Causal Mediation Analysis
Multivariable Mendelian Randomization
Abstracts
The goal of this study is to develop an assay to detect and differentiate methylated circulating tumor DNA (ctDNA) from 8 common cancer types using blood plasma. Tumor and normal tissue sample 450k CG DNA methylation data from The Cancer Genome Atlas (TCGA, n = 9,423) are used to select genomic regions where CGs are hypermethylated in cancer tissue. CGs are selected using a multinomial elastic net model on 70% of the TCGA data where hyperparameters are selected using the harmonic mean of model accuracy and variable stability. A model built on training data using the selected CGs correctly classify cancer types and normal tissue with an average of 93% accuracy on the remaining 30% of data. A final set of 341 genomic loci are selected for use in the assay. Preliminary assay results show that each locus yields an average of 1000 uniquely sequenced DNA molecules per sample which is critical to detect low levels of ctDNA expected in blood plasma in early stages of cancer. We plan to build a classification model using in silico titrations of methylated DNA into data characterized from healthy donor blood plasma, then test the model accuracy using blood plasma from cancer patients.
Keywords
cancer
liquid biopsy
bioinformatics
feature selection
classification
penalized regression
Abstracts
RNAs are versatile regulators of gene expression. RNA secondary structures are known to be important for regulatory functions by various types of RNAs. We developed a statistical algorithm to sample rigorously and exactly from the Boltzmann ensemble of secondary structures. The algorithm is the basis for our Sfold RNA folding software (http://sfold.wadsworth.org).
MicroRNAs are small non-coding RNAs that repress protein synthesis by binding to target mRNAs in multicellular eukaryotes. N6-methyladenosine (m6A) is the most prevalent modification in eukaryotic messenger RNAs. Through statistical analyses of high throughput data, we found that the level of miRNA-mediated target suppression is significantly enhanced when m6A is present on target mRNAs, suggesting functional significance of m6A modification in posttranscriptional gene regulation by microRNAs. We also found that methylated targets have more stable structure than non-methylated targets. We propose a model in which m6A alters local target secondary structure to increase accessibility for efficient binding by Argonaute proteins, leading to enhanced miRNA-mediated regulation.
Keywords
RNA
Secondary structure prediction
algorithm
Abstracts
First Author
Ye Ding, Wadsworth Center, New York State Department of Health
Presenting Author
Ye Ding, Wadsworth Center, New York State Department of Health
With the burgeoning interest in pleiotropy, where a single genetic variant affects multiple traits, the PLACO method was proposed to identify pleiotropic variants between two case-control traits, inclusive of sample overlap scenarios. We introduce the modified PLACO method, a novel scalable statistical approach based on GWAS summary statistics data for enhanced detection of pleiotropic variants across correlated quantitative or qualitative traits. By testing the composite null hypothesis that a variant is linked to at most one trait, the modified PLACO effectively controls type 1 errors and increases detection power for pleiotropy, especially in highly correlated traits. Applied to lipid traits- triglyceride and HDL levels-it unveils shared genetic regions overlooked by conventional methods, later validated by larger datasets. This demonstrates its ability to discover novel associations in traits often missed due to small sample sizes, later validated by larger datasets. This study highlights modified PLACO's potential for discovering novel genetic associations and offers a robust framework for pleiotropy analysis of two traits, regardless of their correlation or sample overlap.
Keywords
GWAS
composite null hypothesis
pleiotropy
Abstracts
Co-Author
Debashree Ray, Johns Hopkins University
First Author
jiwon park, Johns hopkins bloomberg school of public health
Presenting Author
jiwon park, Johns hopkins bloomberg school of public health
Evaluating the effect of a treatment on an outcome via a mediator has received growing attention in clinical and genetic studies. Traditional mediation effect testing methods, including the Wald-type Sobel's test and the Joint Significance test, suffer from overconservative type-I-error and low power under a great quantity of composite null hypotheses. The recently developed divide-aggregate-composite-null test (DACT) properly controls the type-I-error with high power when any of its composite null case has proportion close to 1. But DACT's performance in other settings is unclear. We showed that under unfavorable settings, when no case has proportion close to 0 or when the effect size is large, DACT will fail to control the type-I-error, even with its default normal calibration under Efron's empirical null framework. We proposed a new calibration involving a three-component mixture model for DACT. We controlled the type-I-error while preserving high power compared with state-of-the-art testing methods under both favorable and unfavorable settings. A new procedure for estimating null proportions and a variation of DACT is proposed to boost its null estimation accuracy and power.
Keywords
mediation effect
indirect effect
divide-aggregate composite-null test
mixture model
null proportion estimation
composite null hypothesis
Abstracts
If two haplotypes share the same alleles for an extended gene tract, these haplotypes are likely to derive identical-by-descent (IBD) from a recent common ancestor. The length distribution of IBD segments can be informative about recent demographic changes and strong positive selection. The data is correlated via unobserved ancestral tree and recombination processes, which commonly presents challenges to the derivation of theoretical results in population genetics. Under interpretable regularity conditions, we show that the proportion of detectable IBD segments at locus (IBD rate) is normally distributed for large sample size and large scaled population size. We use efficient and exact simulations to study the non-normality of the IBD rate in finite samples and its implications to downstream statistical inference. Specifically, we discuss our suite of IBD-based statistical methods designed to detect selection and estimate selection coefficients. We indicate that genome-wide scans and selection coefficient estimation based on the IBD rate may be subject to slightly conservative Type 1 error control and loose confidence intervals. Using samples of predominant European ancestry from the TOPMed project, we apply our methods to model recent adaptive evolution at the LCT gene.
Keywords
adaptive evolution
recent relatedness
coalescent models
parametric bootstrap
population genetics
Abstracts
Presenting Author
Yuzheng Dun, Johns Hopkins University
We are interested in assessing ~300 maize genes, selected based on genomic data, for mutations that affect the biological fitness of maize pollen. For each gene, a 1:1 mix of mutant and wild-type pollen is crossed onto a non-mutant ear. In an offspring maize ear, any deviation from a 1:1 proportion between wild-type and mutant kernels would suggest that the associated mutation changes the fitness of the pollen. To detect genes that affect fitness, a generalized linear model (GLM) is used to test if mutations significantly deviated from the 1:1 proportion. The model assumes a quasi-binomial distribution to account for variation across maize ears. For the 30 mutations found to reduce fitness, we also investigate the idea that altered pollen fitness will result in a non-uniform spatial distribution of mutant/wild-type kernels on an ear. A spatial analysis using GLM is therefore conducted on each fitness-altering allele to test for non-random spatial patterns of mutant versus wild-type kernels on a maize ear, such as a gradient effect. Consistent with our motivating idea, results identify several alleles that produce a non-random spatial pattern.
Keywords
Generalized Linear Models
Genotype-phenotype
Quasi-binomial regression
Biological fitness
Spatial Pattern
Abstracts
Spatial transcriptomics emerges as a groundbreaking technology, enabling simultaneous profiling of gene expression and spatial orientation within biological tissues. Yet, when analyzing spatial transcriptomics data, effective integration of expression and spatial information poses considerable analytical challenges. Although many methods have been developed to address this issue, many are platform-specific and lack the general applicability to analyze diverse datasets. In this article, we propose a novel method called Weighted Ensemble method for Spatial Transcriptomics (WEST) that utilizes ensemble techniques to improve the performance and robustness of spatial transcriptomics data analytics. We compare the performance of WEST with five popular methods on both synthetic and real-world datasets. WEST represents a significant advance in detecting spatial domains, offering improved accuracy and flexibility compared to existing methods, making it a valuable tool for spatial transcriptomics data analytics.
Keywords
spatial transcriptomics,
Visium
seqFISH
nsemble learning
deep learning
Abstracts