Computational Advances in Statistical Phylogenomics

Xiang Ji Chair
Tulane University
 
Xiang Ji Organizer
Tulane University
 
Guy Baele Organizer
KU Leuven
 
Monday, Aug 4: 8:30 AM - 10:20 AM
0774 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104E 
The advent of rapid and inexpensive DNA sequencing technologies has resulted in an abundance of data that pose both statistical and computational challenges to conventional phylogenetic methods. For example, the Covid-19 pandemic led to the availability of millions of viral sequences whose rapid analysis had impacts on the practice of public health. Similarly, recent genome sequencing projects (e.g., the 100,000 Genomes Project) have curated massive data sets that hold potential for advancing our understanding of evolutionary processes and the implications of these process on ecosystem sustainability. In this session, we discuss the recent advances in statistical phylogenetic models, statistical and computational inference techniques, and numerical implementations that aim at achieving inference breakthroughs in these scientific research domains. The proposed topics range from phylogenetic tree estimation to phylogenetic network inference, and from topological convergence assessment to its application in pathogen phylodynamic inference.

Applied

Yes

Main Sponsor

Section on Statistical Computing

Co Sponsors

Biometrics Section
Section on Statistics in Genomics and Genetics

Presentations

A statistical method for massively scalable inference of phylogenetic networks

We will introduce a new divide-and-conquer algorithm to merge large explicit networks up to 1000 taxa. We conduct a robust simulation study that shows that our algorithm matches the accuracy of SNaQ on small datasets while drastically improving runtimes, and that drastically outperforms heuristic methods on large datasets while maintaining computational feasibility. Last, we illustrate its performance on an empirical dataset of over 1000 plants. 

Keywords

networks 

Co-Author(s)

Nathan Kolbow
Claudia Solis-Lemus, University of Wisconsin-Madison

Speaker

Nathan Kolbow

Biological causes and impacts of rugged tree landscapes in phylodynamic inference

Phylodynamic analysis has been instrumental in elucidating the spread and evolution dynamics of pathogens and cells. The Bayesian approach to phylodynamics integrates out phylogenetic uncertainty, which is typically substantial in phylodynamic datasets due to low genetic diversity. Bayesian phylodynamic analysis does not, however, scale with modern datasets, partly due to difficulties in traversing tree space. Here, we set out to characterize phylodynamic tree space and assess its impacts on analysis difficulty and key biological inferences. By running extensive Bayesian analyses of 15 classic large phylodynamic datasets and carefully analyzing the posteriors, we find that the posterior landscape in tree space ("tree landscape") is diffuse yet rugged, leading to widespread tree sampling problems that usually stem from a small part of the tree. We develop clade-specific diagnostics to show that a few sequences---including putative recombinants and recurrent mutants---frequently drive the ruggedness and sampling problems, although existing data-quality tests show limited power to detect such sequences. The sampling problems can significantly impact phylodynamic inferences or even distort major biological conclusions; the impact is usually stronger on "local" estimates (e.g., introduction history of a focal clade) than the "global"' parameters (e.g., demographic trajectory) that are governed by the general tree shape. In addition, we demonstrate that heterochronous sampling dates contain considerable information about tree topology, which can be in conflict with genetic data at local scale, leading to further complexity in the tree space and systematic discrepancies between Bayesian and the commonly used stepwise phylodynamic approaches. We evaluate existing and newly-developed MCMC diagnostics, and offer strategies for optimizing MCMC settings and mitigating impacts of the sampling problems. Our findings highlight the need for and directions to develop efficient traversal over the rugged tree landscape, ultimately advancing scalable and reliable phylodynamics. 

Keywords

Bayesian phylodynamics

phylogenetic inference

Markov chain Monte Carlo

viral evolution

heterochronous sequences

single-cell sequencing 

Co-Author(s)

Jiansi Gao
Andrew Magee
Luiz Carvalho, Getulio Vargas Foundation
Marius Brusselmans, KU Leuven
Marc Suchard, University of California-Los Angeles
Guy Baele, KU Leuven
Frederick Matsen, Fred Hutchinson Cancer Research Center

Speaker

Jiansi Gao, Fred Hutch Cancer Center

Composite likelihood approaches to phylogenetic inference under the multispecies coalescent

Species-level phylogenetic inference under the multispecies coalescent model remains challenging in the typical inference frameworks (e.g., the likelihood and Bayesian frameworks) due to the dimensionality of the space of both gene trees and species trees. Algebraic approaches intended to establish identifiability of species tree parameters have suggested computationally efficient inference procedures that have been widely used by empiricists and that have good theoretical properties, such as statistical consistency. However, such approaches are less powerful than approaches based on the full likelihood. In this talk, I will describe how the use of a composite likelihood approach enables computationally tractable statistical inference of the species-level phylogenetic relationships for genome-scale data. In particular, asymptotic properties of estimators obtained in the composite-likelihood framework will be derived, and the utility of the methods developed will be demonstrated with both simulated and empirical data. 

Keywords

Composite likelihood

Phylogenetics

Multispecies coalescent

pseudo likelihood

DNA sequences 

Speaker

Laura Kubatko, The Ohio State University

Detecting emerging pathogen lineages

Because viruses belonging to lineages with higher fitness are expected to transmit rapidly to new hosts before incurring very many substitutions, large numbers of related sequences appearing in data are sometimes interpreted as a sign of elevated fitness. Tree statistics inspired by this idea, such as the local branching index, are easily calculated from a given phylogenetic tree (or a distribution of trees). However, epidemiological confounders like superspreading, host population heterogeneity, or sampling biases may introduce spurious patterns in phylogenetic trees that undermine our ability to identify emerging lineages. To address this, we use stochastic compartmental models to simulate outbreaks and generate distributions of phylogenies under a variety of epidemiological conditions and testing strategies. By characterizing the types of phylogenies expected under these situations, we characterize the types of signals we can detect. 

Co-Author

Caroline Colijn, Simon Fraser University

Speaker

Alex Beams, Simon Fraser University