“Big-ish” Data in R: Efficient tools for large in-memory datasets

Kelly Bodwin Chair
California Polytechnic State University
 
Michael Chirico Discussant
Google
 
Kelly Bodwin Organizer
California Polytechnic State University
 
Tuesday, Aug 6: 2:00 PM - 3:50 PM
1854 
Topic-Contributed Paper Session 
Oregon Convention Center 
Room: CC-E146 
Between the small datasets of classical statistical analysis and the massive databases of distributed systems lies "big-ish" data: datasets that can be read directly into R on a personal computer, but that are large enough to make common data operations slow. This session highlights recent work in developing and testing R tools designed to speed up analysis of such large in-memory datasets, such as {arrow}, {data.table}, and {vroom}. We will share insights into the design, development, and maintenance of such tools; as well as examples of their use in real-world applications.

Applied

Yes

Main Sponsor

Section on Statistical Computing

Co Sponsors

Section for Statistical Programmers and Analysts

Presentations

Apache Arrow: building bridges across the modern data ecosystem

The Arrow project was founded on building bridges across the data ecosystem. The Arrow project itself is multifaceted (and still growing!) but at its core is a modern representation for data that many systems have adopted or have efficient interfaces to accept. One of the many sub-projects is the Arrow R library and the Arrow C++ library which the Arrow R library extends. With the Arrow format acting as a bridge (after it's been transformed from R-internal representations), we can leverage the performant code in the Arrow C++ library to do things like query data, store to and read from modern data formats, and communicate with other tools that speak arrow (e.g. DuckDB, a python process running pyarrow, spark). All of this means that R users get the benefit of a huge amount of C++ work so they can use their favorite R data manipulation patterns (tidyverse, data.table, duckdb) while also being able to speak broadly to the modern data world with parquet files, using Arrow-based data connections. 

Speaker

Jonathan Keane, Posit, PBC

Creating a self-sustaining ecosystem for data.table

The R package data.table provides basic data analysis functionality (reading, writing, summarization, reshaping, etc). Its two main features are (1) highly efficient C code, and (2) concise and expressive R syntax involving square brackets, DT[i, j, by]. The NSF POSE program has funded a project to promote a sustainable open-source ecosystem around data.table, during 2023-2025. Project activities include creating a governance and code of conduct document, new documentation (including translations), outreach/teaching (including travel awards), and new testing infrastructure (performance and reverse dependencies). The goal is to create a thriving open-source ecosystem, with a diverse community of contributors, that will continue to maintain and update data.table, even after the end of the NSF project. In this talk I will discuss our progress toward achieving this goal.  

Co-Author

Toby Hocking, Northern Arizona University

Speaker

Anirban Chetia, Northern Arizona University

Efficient tools for your tidy workflow

Preparing "tidy" data is the process of cleaning, reshaping, and formatting data to consist of rectangular data with observations in rows and variables in columns. This format is often ideal for data analytics and statistical analysis. In R, the Tidyverse has a defined set of methods to help tidy the data that come with a grammar on how to communicate these methods. However, large data—data that have millions of rows but need to be worked on in-memory—are common, which can require other tools built for large data. In this talk, I will highlight a tidy workflow that uses the data.table R package using the grammar established by the Tidyverse. I will highlight how this package efficiently, concisely, and quickly tidy data, including grouped operations, aggregations, and pivoting on data with 10 million rows and 50 columns. This introduction will provide attendees with resources to start using these tools on their own large data and will highlight the benefits of incorporating data.table into their workflow. 

Speaker

Tyson Barrett, Utah State University