How to list file keys in Databricks dbfs **without** dbutils

…衆ロ難τιáo~ 提交于 2021-01-07 01:21:08

问题


Apparently dbutils cannot be used in cmd-line spark-submits, you must use Jar Jobs for that, but I MUST use spark-submit style jobs due to other requirements, yet still have a need to list and iterate over file keys in dbfs to make some decisions about which files to use as input to a process...

Using scala, what lib in spark or hadoop can I use to retrieve a list of dbfs:/filekeys of a particular pattern?

import org.apache.hadoop.fs.Path
import org.apache.spark.sql.SparkSession

def ls(sparkSession: SparkSession, inputDir: String): Seq[String] = {
  println(s"FileUtils.ls path: $inputDir")
  val path = new Path(inputDir)
  val fs = path.getFileSystem(sparkSession.sparkContext.hadoopConfiguration)
  val fileStatuses = fs.listStatus(path)
  fileStatuses.filter(_.isFile).map(_.getPath).map(_.getName).toSeq
}

Using the above, if I pass in a partial key prefix like dbfs:/mnt/path/to/folder while the following keys are present in said "folder":

  • /mnt/path/to/folder/file1.csv
  • /mnt/path/to/folder/file2.csv

I get dbfs:/mnt/path/to/folder is not a directory when it hits val path = new Path(inputDir)


回答1:


Need to use the SparkSession to do it.

Here's how we did it:

import org.apache.commons.io.IOUtils
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession

def getFileSystem(sparkSession: SparkSession): FileSystem =
    FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)

def listContents(sparkSession: SparkSession, dir: String): Seq[String] = {
  getFileSystem(sparkSession).listStatus(new path(dir)).toSeq.map(_.getPath).map(_.getName)
}


来源:https://stackoverflow.com/questions/64757059/how-to-list-file-keys-in-databricks-dbfs-without-dbutils

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!