I\'d like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.
Is an R reader available? Or is work being done on one?
You can use the arrow
package for this. It is the same thing as in Python pyarrow
but this nowadays also comes packaged for R without the need for Python.
git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release
# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
Then you can install the R arrow
package:
devtools::install_github("apache/arrow/r")
And use it to load a Parquet file
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
#> The following objects are masked from 'package:base':
#>
#> array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#> x y
#>
#> …