Apache Arrow: building bridges across the modern data ecosystem

Jonathan Keane Speaker
Posit, PBC
 
Tuesday, Aug 6: 2:00 PM - 3:50 PM
Topic-Contributed Paper Session 
Oregon Convention Center 
The Arrow project was founded on building bridges across the data ecosystem. The Arrow project itself is multifaceted (and still growing!) but at its core is a modern representation for data that many systems have adopted or have efficient interfaces to accept. One of the many sub-projects is the Arrow R library and the Arrow C++ library which the Arrow R library extends. With the Arrow format acting as a bridge (after it's been transformed from R-internal representations), we can leverage the performant code in the Arrow C++ library to do things like query data, store to and read from modern data formats, and communicate with other tools that speak arrow (e.g. DuckDB, a python process running pyarrow, spark). All of this means that R users get the benefit of a huge amount of C++ work so they can use their favorite R data manipulation patterns (tidyverse, data.table, duckdb) while also being able to speak broadly to the modern data world with parquet files, using Arrow-based data connections.