Monday, Aug 4: 8:30 AM - 10:20 AM
0774
Topic-Contributed Paper Session
Music City Center
Room: CC-104E
The advent of rapid and inexpensive DNA sequencing technologies has resulted in an abundance of data that pose both statistical and computational challenges to conventional phylogenetic methods. For example, the Covid-19 pandemic led to the availability of millions of viral sequences whose rapid analysis had impacts on the practice of public health. Similarly, recent genome sequencing projects (e.g., the 100,000 Genomes Project) have curated massive data sets that hold potential for advancing our understanding of evolutionary processes and the implications of these process on ecosystem sustainability. In this session, we discuss the recent advances in statistical phylogenetic models, statistical and computational inference techniques, and numerical implementations that aim at achieving inference breakthroughs in these scientific research domains. The proposed topics range from phylogenetic tree estimation to phylogenetic network inference, and from topological convergence assessment to its application in pathogen phylodynamic inference.
Applied
Yes
Main Sponsor
Section on Statistical Computing
Co Sponsors
Biometrics Section
Section on Statistics in Genomics and Genetics
Presentations
We will introduce a new divide-and-conquer algorithm to merge large explicit networks up to 1000 taxa. We conduct a robust simulation study that shows that our algorithm matches the accuracy of SNaQ on small datasets while drastically improving runtimes, and that drastically outperforms heuristic methods on large datasets while maintaining computational feasibility. Last, we illustrate its performance on an empirical dataset of over 1000 plants.
Keywords
networks
Phylodynamic analysis has been instrumental in elucidating the spread and evolution dynamics of pathogens and cells. The Bayesian approach to phylodynamics integrates out phylogenetic uncertainty, which is typically substantial in phylodynamic datasets due to low genetic diversity. Bayesian phylodynamic analysis does not, however, scale with modern datasets, partly due to difficulties in traversing tree space. Here, we set out to characterize phylodynamic tree space and assess its impacts on analysis difficulty and key biological inferences. By running extensive Bayesian analyses of 15 classic large phylodynamic datasets and carefully analyzing the posteriors, we find that the posterior landscape in tree space ("tree landscape") is diffuse yet rugged, leading to widespread tree sampling problems that usually stem from a small part of the tree. We develop clade-specific diagnostics to show that a few sequences---including putative recombinants and recurrent mutants---frequently drive the ruggedness and sampling problems, although existing data-quality tests show limited power to detect such sequences. The sampling problems can significantly impact phylodynamic inferences or even distort major biological conclusions; the impact is usually stronger on "local" estimates (e.g., introduction history of a focal clade) than the "global"' parameters (e.g., demographic trajectory) that are governed by the general tree shape. In addition, we demonstrate that heterochronous sampling dates contain considerable information about tree topology, which can be in conflict with genetic data at local scale, leading to further complexity in the tree space and systematic discrepancies between Bayesian and the commonly used stepwise phylodynamic approaches. We evaluate existing and newly-developed MCMC diagnostics, and offer strategies for optimizing MCMC settings and mitigating impacts of the sampling problems. Our findings highlight the need for and directions to develop efficient traversal over the rugged tree landscape, ultimately advancing scalable and reliable phylodynamics.
Keywords
Bayesian phylodynamics
phylogenetic inference
Markov chain Monte Carlo
viral evolution
heterochronous sequences
single-cell sequencing
Species-level phylogenetic inference under the multispecies coalescent model remains challenging in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. Algebraic approaches intended to establish identifiability of species tree parameters have suggested computationally efficient inference procedures that have been widely used by empiricists and that have good theoretical properties, such as statistical consistency. However, such approaches are less powerful than approaches based on the full likelihood. In this talk, I will describe how the use of a composite likelihood approach enables computationally tractable statistical inference of the species-level phylogenetic relationships for genome-scale data. In particular, asymptotic properties of estimators obtained in the composite-likelihood framework will be derived, and the utility of the methods developed will be demonstrated with both simulated and empirical data.
Keywords
Composite likelihood
Phylogenetics
Multispecies coalescent
pseudo likelihood
DNA sequences
Because viruses belonging to lineages with higher fitness are expected to transmit rapidly to new hosts before incurring very many substitutions, large numbers of related sequences appearing in data are sometimes interpreted as a sign of elevated fitness. Tree statistics inspired by this idea, such as the local branching index, are easily calculated from a given phylogenetic tree (or a distribution of trees). However, epidemiological confounders like superspreading, host population heterogeneity, or sampling biases may introduce spurious patterns in phylogenetic trees that undermine our ability to identify emerging lineages. To address this, we use stochastic compartmental models to simulate outbreaks and generate distributions of phylogenies under a variety of epidemiological conditions and testing strategies. By characterizing the types of phylogenies expected under these situations, we characterize the types of signals we can detect.