问题
We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data.
I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable.
The two options I founnd online were - SparkR - dataconnector (vertica)
Since installing dataconnector on Windows machine was problamatic, I installed SparkR and I am now able to read a local orc.file (R local on my machine, orc file local on my machine). However if i try read.orc, it by default normalizes my path to a local path. Digging into the source code, I ran the following:
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)
But I obtained the following error:
Error: Error in orc : java.io.IOException: No FileSystem for scheme: https
Could someone help me either with this problem or pointing to an alternative way to load orc files from S3?
回答1:
Edited answer: now you can read directly from S3 instead of first downloading and reading from the local file system
On request of mrjoseph: a possible solution via SparkR (which in the first place I did not want to do).
# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')
# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")
# load required packages
library(aws.s3)
library(SparkR)
## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)
# Set path to file
path <- "s3n://bucketname/filename.orc"
# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")
# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)
# Read first lines
head(temp)
来源:https://stackoverflow.com/questions/42955469/r-read-orc-file-from-s3