How do I read a Parquet in R and convert it to an R DataFrame?

后端 未结 9 1289
北荒
北荒 2020-12-28 13:04

I\'d like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

相关标签:
9条回答
  • 2020-12-28 13:35

    You can simply use the arrow package:

    install.packages("arrow")
    library(arrow)
    read_parquet("myfile.parquet")
    
    0 讨论(0)
  • 2020-12-28 13:35

    If you have a multi-file parquet file, you might need to do something like this :

    data.table::rbindlist(lapply(Sys.glob("path_to_parquet/part-*.parquet"), arrow::read_parquet))
    
    0 讨论(0)
  • 2020-12-28 13:43

    You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python.

    git clone https://github.com/apache/arrow.git
    cd arrow/cpp && mkdir release && cd release
    
    # It is important to statically link to boost libraries
    cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
    make install
    

    Then you can install the R arrow package:

    devtools::install_github("apache/arrow/r")
    

    And use it to load a Parquet file

    library(arrow)
    #> 
    #> Attaching package: 'arrow'
    #> The following object is masked from 'package:utils':
    #> 
    #>     timestamp
    #> The following objects are masked from 'package:base':
    #> 
    #>     array, table
    read_parquet("somefile.parquet", as_tibble = TRUE)
    #> # A tibble: 10 x 2
    #>        x       y
    #>    <int>   <dbl>
    #> …
    
    0 讨论(0)
提交回复
热议问题