Print Close

Apache Arrow: building bridges across the modern data ecosystem

Presented During: “Big-ish” Data in R: Efficient tools for large in-memory datasets

Jonathan Keane Speaker
Posit, PBC

Tuesday, Aug 6: 2:05 PM - 2:25 PM
Topic-Contributed Paper Session

Oregon Convention Center

The Arrow project was founded on building bridges across the data ecosystem. The Arrow project itself is multifaceted (and still growing!) but at its core is a modern representation for data that many systems have adopted or have efficient interfaces to accept. One of the many sub-projects is the Arrow R library and the Arrow C++ library which the Arrow R library extends. With the Arrow format acting as a bridge (after it's been transformed from R-internal representations), we can leverage the performant code in the Arrow C++ library to do things like query data, store to and read from modern data formats, and communicate with other tools that speak arrow (e.g. DuckDB, a python process running pyarrow, spark). All of this means that R users get the benefit of a huge amount of C++ work so they can use their favorite R data manipulation patterns (tidyverse, data.table, duckdb) while also being able to speak broadly to the modern data world with parquet files, using Arrow-based data connections.