Wednesday, Aug 6: 2:00 PM - 3:50 PM
0236
Invited Paper Session
Music City Center
Room: CC-205C
Applied
Yes
Main Sponsor
Section on Statistical Computing
Co Sponsors
Section on Statistical Learning and Data Science
Section on Statistics in Genomics and Genetics
Presentations
Quantum advantage has been demonstrated in physics-oriented problems. It remains elusive whether quantum advantage can be established for modern computational biology problems. In this talk, I will introduce a new quantum machine learning algorithm for analyzing single-cell multi-omics data. The proposed algorithm takes advantage of quantum parallelism to enable fast computation. Theoretical results are derived to show the advantages of the proposed algorithm in terms of estimation error and computational complexity. Simulation suggests that our algorithm is effective in a wide range of settings.
Keywords
single-cell experiments
quantum computing
model selection
Grover's algorithm
quantum counting
bioinformatics
Speaker
Ping Ma, University of Georgia
The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments. CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes. We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case. Furthermore, we explore the ethical and regulatory considerations associated with automated gene-editing design, highlighting the need for responsible and transparent use of these tools. Our work aims to bridge the gap between beginner biological researchers and CRISPR genome engineering techniques, and demonstrate the potential of LLM agents in facilitating complex biological discovery tasks.
Keywords
LLM, CRISPR-GPT, genome engineering
With the rapid development of artificial intelligence, causal inference with observational data has drawn much attention in various scientific domains. A key challenge in this area is the often-violated ignorability assumption, which is critical for unbiased estimation of causal effects, but very hard to check in practice. To address this challenge, we develop Double Generative Learning (DGL), a novel approach that leverages the capabilities of generative adversarial networks (GANs) for robust causal inference under the violation of ignorability. By employing a delicate dual GANs structure, DGL emulates data akin to randomized controlled trials (RCTs) solely based on observational studies, circumventing the biases introduced by unobserved confounders. This methodology not only proposes an elegant solution to the issue of ignorability violation by achieving minimax optimality in robustness but also adeptly manages high-dimensional and complex data structures. Theoretical analysis reveals DGL's capacity to bypass the curse of dimensionality by exploiting the inherent low-dimensional submanifold structures in the data. Through extensive simulation studies and analyses of real-world datasets, DGL's empirical superiority in facilitating robust causal inference under adverse conditions is comprehensively
demonstrated.
Keywords
Average treatment effect; Observational study; Ignorability; Randomized controlled trials; Curse of dimensionality; Generative adversarial networks;
Recent advances in spatially resolved transcriptomics (SRT) have illuminated gene co-expression networks in spatial contexts, offering insights into disease mechanisms. However, current methods, mainly designed for single-cell studies, tend to overlook the intricate interactions between spatial location and gene expression networks. None of them are able to handle the increasingly prevalent large-scale datasets. To address these limitations, we propose a novel matrix normal based method, spMGM, for inferring gene co-expression networks in SRT studies. spMGM accounts for intricate interactions between spatial context and gene expression. Through extensive simulations, both model-based and non-model based, spMGM accurately recovers the underlying gene co-expression network, improving accuracy by 40% - 50% compared to existing methods. Moreover, spMGM can efficiently handle large-scale datasets like 10x Xenium, with 10 times faster than the most advanced method. Applying spGMM to breast cancer tissue demonstrates its ability to detect breast cancer-related hub genes that have not been identified by the other methods.
Keywords
Matrix normal Graphical Model
Gene Spatial co-expression
Many medical diagnoses represent heterogeneous conditions that combine a number of subtypes before clinical presentation. Clustering analyses of patients with such diagnoses may reveal these underlying subtypes and help in the development of more homogeneous clinical phenotypes which can be targeted by more specific treatments to prevent disease progression. We present a nonparametric machine learning approach to clustering patients based on the Random Forest algorithm which accommodates the mixed variable types and skewness of standard medical data. To illustrate the approach we use cohort data from the Multicenter Osteoarthritis Study and from the similarly-designed Osteoarthritis Initiative Study to evaluate subtypes of patients undergoing knee replacement surgery and compare the cluster results to those obtained by the k-means clustering algorithm. We find the Random Forest approach to produce clusters with greater interpretability and with less impact from the study design features than the k-means algorithm.
Keywords
Unsupervised Learning
Classification Trees
Biomedical Data
Osteoarthritis