I can load multiple files at once by passing multiple paths to the load
method, e.g.
spark.read
.format(\"com.databricks.spark.avro\")
.load
You just need is a splat operator (_*
) the paths
list as
spark.read.format("com.databricks.spark.avro").load(paths: _*)
load
method support varargs type of argument, not the list type. So you have explicitly convert list to varargs adding : _*
in load function.
spark.read.format("com.databricks.spark.avro").load(paths: _*)
Also, you can use the paths
option, from Spark code source (ResolvedDataSource.scala):
val paths = {
if (caseInsensitiveOptions.contains("paths") &&
caseInsensitiveOptions.contains("path")) {
throw new AnalysisException(s"Both path and paths options are present.")
}
caseInsensitiveOptions.get("paths")
.map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
.getOrElse(Array(caseInsensitiveOptions("path")))
.flatMap{ pathString =>
val hdfsPath = new Path(pathString)
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
}
}
So a simple:
sqlContext.read.option("paths", paths.mkString(",")).load()
Will do the trick.
You need not to create list. You can do like below
val df=spark.read.format("com.databricks.spark.avro").option("header","true").load("/data/src/entity1/*")