Spark 2.0.0: SparkR CSV Import

问题

I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features.

Using RStudio here.

I am getting an error while "reading" the source file.

My code:

Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", appName = "SparkR")
df <- loadDF("F:/file.csv", "csv", header = "true")

I get an error at at the loadDF function.

The error:

loadDF("F:/file.csv", "csv", header = "true")

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt

Am I missing some specification here? Any pointers to proceed would be appreciated.

回答1:

I have the same problem. But similary problem with this simple code

createDataFrame(iris)

May be some wrong in installation ?

UPD. YES ! I find solution.

This solution based on this: Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(...)

For R just start session by this code:

sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="/file:C:/temp"))

回答2:

Maybe you should try reading the CSV with this library

https://github.com/databricks/spark-csv

Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6")

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

sparkR.session(master = "local[*]", appName = "SparkR")  

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-shell"')

sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", inferSchema = "true")

来源：https://stackoverflow.com/questions/38659074/spark-2-0-0-sparkr-csv-import

标签

csv

apache-spark

spark-dataframe

sparkr