Storing, Importing, Managing, and Analyzing Large Data Locally with R

Kelly Bodwin Instructor
California Polytechnic State University
 
Tyson Barrett Instructor
 
Jonathan Keane Instructor
Posit, PBC
 
Monday, Aug 4: 8:30 AM - 5:00 PM
CE_14 
Professional Development Course/CE 
Music City Center 
Room: CC-109 
It is increasingly common in academic and professional settings to encounter datasets large enough to exceed the capabilities of standard data processing tools, yet small enough to be stored on local computers. Recent articles even claim that "the era of big data is over" and that data analysts and researchers should "think small, develop locally, ship joyfully." Such "medium" dataests are instrumental in measuring, tracking, and recording a wide array of phenomena across disciplines such as human behavior, animal studies, geology, economics, and astronomy. In this workshop, we will present modern techniques for handling large local data in R using a tidy data pipeline, encompassing stages from data storage and importing to cleaning, analysis, and exporting data and analyses. Specifically, we will teach a combination of tools from the data.table, arrow, and duckDB packages, with a focus on parquet data files for storage and transfer. By the end of the workshop, participants will understand how to integrate these tools to establish a legible, reproducible, efficient, and high-performance workflow.