How to pass a list of paths to spark.read.load?

前端 未结 4 1567
梦如初夏
梦如初夏 2021-01-06 09:31

I can load multiple files at once by passing multiple paths to the load method, e.g.

spark.read
  .format(\"com.databricks.spark.avro\")
  .load         


        
相关标签:
4条回答
  • 2021-01-06 10:02

    You just need is a splat operator (_*) the paths list as

    spark.read.format("com.databricks.spark.avro").load(paths: _*)
    
    0 讨论(0)
  • 2021-01-06 10:15

    load method support varargs type of argument, not the list type. So you have explicitly convert list to varargs adding : _* in load function.

    spark.read.format("com.databricks.spark.avro").load(paths: _*)
    
    0 讨论(0)
  • 2021-01-06 10:16

    Also, you can use the paths option, from Spark code source (ResolvedDataSource.scala):

    val paths = {
                if (caseInsensitiveOptions.contains("paths") &&
                  caseInsensitiveOptions.contains("path")) {
                  throw new AnalysisException(s"Both path and paths options are present.")
                }
                caseInsensitiveOptions.get("paths")
                  .map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
                  .getOrElse(Array(caseInsensitiveOptions("path")))
                  .flatMap{ pathString =>
                    val hdfsPath = new Path(pathString)
                    val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
                    val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
                    SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
                  }
              }
    

    So a simple:

    sqlContext.read.option("paths", paths.mkString(",")).load()
    

    Will do the trick.

    0 讨论(0)
  • 2021-01-06 10:17

    You need not to create list. You can do like below

    val df=spark.read.format("com.databricks.spark.avro").option("header","true").load("/data/src/entity1/*")
    
    0 讨论(0)
提交回复
热议问题