Read all files in a nested folder in Spark

前端 未结 4 1409
无人共我
无人共我 2020-12-30 03:15

If we have a folder folder having all .txt files, we can read them all using sc.textFile(\"folder/*.txt\"). But what if I have a folde

相关标签:
4条回答
  • 2020-12-30 03:32

    Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

    val df= sparkSession.read
           .option("recursiveFileLookup","true")
          .option("header","true")
          .csv("src/main/resources/nested")
    

    This recursively loads the files from src/main/resources/nested and it's subfolders.

    0 讨论(0)
  • 2020-12-30 03:46

    If directory structure is regular, lets say something like this:

    folder
    ├── a
    │   ├── a
    │   │   └── aa.txt
    │   └── b
    │       └── ab.txt
    └── b
        ├── a
        │   └── ba.txt
        └── b
            └── bb.txt
    

    you can use * wildcard for each level of nesting as shown below:

    >>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
    
    [u'file:/folder/a/a/aa.txt',
     u'file:/folder/a/b/ab.txt',
     u'file:/folder/b/a/ba.txt',
     u'file:/folder/b/b/bb.txt']
    
    0 讨论(0)
  • 2020-12-30 03:47

    if you want use only files which start with name "a" ,you can use

    sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")
    

    as well. We can use * as wildcard.

    0 讨论(0)
  • 2020-12-30 03:48

    sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

    if you want to load the contents of all matched files in a directory, you should use

    sc.textFile("/directory/201910*/part-*.lzo")
    

    and setting reading directory recursive!

    sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
    

    TIPS: scala differ with python, below set use to scala!

    sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
    
    0 讨论(0)
提交回复
热议问题