R read ORC file from S3

我怕爱的太早我们不能终老 提交于 2019-12-11 03:06:08

问题


We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data.

I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable.

The two options I founnd online were - SparkR - dataconnector (vertica)

Since installing dataconnector on Windows machine was problamatic, I installed SparkR and I am now able to read a local orc.file (R local on my machine, orc file local on my machine). However if i try read.orc, it by default normalizes my path to a local path. Digging into the source code, I ran the following:

sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)

But I obtained the following error:

Error: Error in orc : java.io.IOException: No FileSystem for scheme: https

Could someone help me either with this problem or pointing to an alternative way to load orc files from S3?


回答1:


Edited answer: now you can read directly from S3 instead of first downloading and reading from the local file system

On request of mrjoseph: a possible solution via SparkR (which in the first place I did not want to do).

# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')

# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")

# load required packages
library(aws.s3)
library(SparkR)

## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

# Set path to file
path <- "s3n://bucketname/filename.orc"

# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")

# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)

# Read first lines
head(temp)


来源:https://stackoverflow.com/questions/42955469/r-read-orc-file-from-s3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!