Tuesday, Aug 5: 10:30 AM - 12:20 PM
0355
Invited Paper Session
Music City Center
Room: CC-101A
Applied
Yes
Main Sponsor
Section on Statistical Computing
Co Sponsors
Section on Statistical Learning and Data Science
Section on Statistics and Data Science Education
Presentations
In this talk, we will discuss the inference challenges in the
field of phylogenetics whose overall goal is to reconstruct the Tree of Life,
the graphical representation of the evolutionary process from the origin of
life to the diversity we see nowadays. This tree is estimated from genomic
sequences, and the inference procedure is computationally intensive in terms of
unstable numerical optimization and inefficient heuristic traversal of tree
space. We will highlight recent advances in phylogenetic inference via our new
organization JuliaPhylo which encompasses several Julia packages aimed at the
evolutionary biology community.
Keywords
phylogenetics
networks
evolution
trees
networks
Traditional approaches to mixed effects models using generalized least squares or expectation-maximization approaches struggle to scale to datasets with many thousands of observations and hundreds of levels of a single blocking variable. Special casing of nesting or crossing of random effects is required to achieve acceptable computational performance, but this special casing often makes it very difficult to handle less-than-idealized cases, such partial crossing or multiple levels of nesting. In contrast, an approach based on penalized least squares can take advantage of sparse matrix methods to scale to models with millions of observations and handles nesting and crossing of random effects in a general way. This approach was initially demonstrated by the lme4 package in R.
More recently, the MixedModels.jl package in Julia has expanded upon this foundation. Julia helps to solve the "two language problem": the entirety of MixedModels.jl is written in Julia, unlike lme4 which mixes R and C++ code. Keeping everything in one language makes it much easier for experimentation of potential computational enhancements, such as optimizing the storage of various matrices and intermediate quantities. Moreover, it makes it much easier to onboard other developers, such as myself, as productive collaborators and maintainers. Finally, Julia's use of multiple dispatch is also particularly useful: we are able to use specialized methods for particular patterns of sparsity that arise in the penalized-least squares formulation, which offers an additional performance improvement over relying on generic sparse matrix methods.
As a demonstration of the capabilities that Julia unlocks in this domain, I will present a model fit to the MovieLens (Harper and Konstan, 2015) data containing 32 million observations and partially crossed random effects with 200,948 and 87,585 levels. With scalar valued random effects, it is possible to fit this model in a few hours on a computer with sufficient memory. It is also possible to fit the model with vector valued random effects, although the corresponding fit time increases substantially.
Although we are already very excited to be able to fit such large models at all, we want to fit them even faster. Julia enables us to continue algorithmic development in a coherent way. For example, we can consider alternative representations of triangular matrices in order to have a more compact representation in memory, thus both reducing the memory burden and potentially increasing BLAS performance. Given the reliance on sparse matrix methods, MixedModels.jl has been developed with a focus on CPU and not GPU, because of the historically poor performance of GPUs with sparse matrices. However, the Julia ecosystem provides powerful tooling capable of supporting a hybrid approach combining CPU and GPU, without necessitating the need to call out to another language.
Keywords
mixed models
penalized least squares
crossed random effects
big data
Julia, a modern open-source programming language for technical computing, delivers superior speed and productivity compared to R or Python, as high-performance code does not need to be wrapped in a low-level language like C or Fortran. After almost a decade of active development, Julia reached its first major release v1.0 on Aug 8, 2018 and is quickly gaining popularity in the communities of scientific computing and data science. This talk discusses the challenges and opportunities of teaching Julia in the context of statistical computing. Examples include comparing Julia, R, and Python, numerical linear algebra, numerical optimization, parallel/distributed computing, and GPU computing. It draws on the presenter's extensive experience teaching statistical computing and Julia in university classrooms and at conferences.
Keywords
Julia
statistical computing
high-performance computing
GPU