HIGH PERFORMANCE DATA SCIENCE USING JULIA: PROMISE, SUCCESSES, AND CHALLENGES

Hyeonju Kim Chair
NCTR
 
Saunak Sen Discussant
University of Tennessee Health Science Center
 
Saunak Sen Organizer
University of Tennessee Health Science Center
 
Gregory Farage Organizer
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
0355 
Invited Paper Session 
Music City Center 
Room: CC-101A 

Applied

Yes

Main Sponsor

Section on Statistical Computing

Co Sponsors

Section on Statistical Learning and Data Science
Section on Statistics and Data Science Education

Presentations

Julia for phylogenetic inference

In this talk, we will discuss the inference challenges in the
field of phylogenetics whose overall goal is to reconstruct the Tree of Life,
the graphical representation of the evolutionary process from the origin of
life to the diversity we see nowadays. This tree is estimated from genomic
sequences, and the inference procedure is computationally intensive in terms of
unstable numerical optimization and inefficient heuristic traversal of tree
space. We will highlight recent advances in phylogenetic inference via our new
organization JuliaPhylo which encompasses several Julia packages aimed at the
evolutionary biology community. 

Keywords

phylogenetics

networks

evolution

trees

networks 

Co-Author

Claudia Solis-Lemus, University of Wisconsin-Madison

Speaker

Joshua Justison

Massive mixed models in Julia

Traditional approaches to mixed effects models using generalized least squares or expectation-maximization approaches struggle to scale to datasets with many thousands of observations and hundreds of levels of a single blocking variable. Special casing of nesting or crossing of random effects is required to achieve acceptable computational performance, but this special casing often makes it very difficult to handle less-than-idealized cases, such partial crossing or multiple levels of nesting. In contrast, an approach based on penalized least squares can take advantage of sparse matrix methods to scale to models with millions of observations and handles nesting and crossing of random effects in a general way. This approach was initially demonstrated by the lme4 package in R.

More recently, the MixedModels.jl package in Julia has expanded upon this foundation. Julia helps to solve the "two language problem": the entirety of MixedModels.jl is written in Julia, unlike lme4 which mixes R and C++ code. Keeping everything in one language makes it much easier for experimentation of potential computational enhancements, such as optimizing the storage of various matrices and intermediate quantities. Moreover, it makes it much easier to onboard other developers, such as myself, as productive collaborators and maintainers. Finally, Julia's use of multiple dispatch is also particularly useful: we are able to use specialized methods for particular patterns of sparsity that arise in the penalized-least squares formulation, which offers an additional performance improvement over relying on generic sparse matrix methods.

As a demonstration of the capabilities that Julia unlocks in this domain, I will present a model fit to the MovieLens (Harper and Konstan, 2015) data containing 32 million observations and partially crossed random effects with 200,948 and 87,585 levels. With scalar valued random effects, it is possible to fit this model in a few hours on a computer with sufficient memory. It is also possible to fit the model with vector valued random effects, although the corresponding fit time increases substantially.

Although we are already very excited to be able to fit such large models at all, we want to fit them even faster. Julia enables us to continue algorithmic development in a coherent way. For example, we can consider alternative representations of triangular matrices in order to have a more compact representation in memory, thus both reducing the memory burden and potentially increasing BLAS performance. Given the reliance on sparse matrix methods, MixedModels.jl has been developed with a focus on CPU and not GPU, because of the historically poor performance of GPUs with sparse matrices. However, the Julia ecosystem provides powerful tooling capable of supporting a hybrid approach combining CPU and GPU, without necessitating the need to call out to another language.  

Keywords

mixed models

penalized least squares

crossed random effects

big data 

Co-Author

Phillip Alday

Speaker

Phillip Alday

Teaching Statistical Computing Using Julia

Julia, a modern open-source programming language for technical computing, delivers superior speed and productivity compared to R or Python, as high-performance code does not need to be wrapped in a low-level language like C or Fortran. After almost a decade of active development, Julia reached its first major release v1.0 on Aug 8, 2018 and is quickly gaining popularity in the communities of scientific computing and data science. This talk discusses the challenges and opportunities of teaching Julia in the context of statistical computing. Examples include comparing Julia, R, and Python, numerical linear algebra, numerical optimization, parallel/distributed computing, and GPU computing. It draws on the presenter's extensive experience teaching statistical computing and Julia in university classrooms and at conferences. 

Keywords

Julia

statistical computing

high-performance computing

GPU 

Co-Author

Hua Zhou, UCLA

Speaker

Hua Zhou, UCLA