How do I read a Parquet in R and convert it to an R DataFrame?

后端未结

关注

 9  1289

北荒

I\'d like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

相关标签:

9条回答

忘了有多久

2020-12-28 13:35
You can simply use the arrow package:
```
install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-28 13:35
If you have a multi-file parquet file, you might need to do something like this :
```
data.table::rbindlist(lapply(Sys.glob("path_to_parquet/part-*.parquet"), arrow::read_parquet))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2020-12-28 13:43

You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python.

git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install

Then you can install the R arrow package:

devtools::install_github("apache/arrow/r")

And use it to load a Parquet file

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#>        x       y
#>    <int>   <dbl>
#> …

0 讨论(0)

上一页 1 2