Read all files in a nested folder in Spark

前端未结

关注

 4  1409

If we have a folder folder having all .txt files, we can read them all using sc.textFile(\"folder/*.txt\"). But what if I have a folde

相关标签:

4条回答

情话喂你

2020-12-30 03:32
Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.
```
val df= sparkSession.read
       .option("recursiveFileLookup","true")
      .option("header","true")
      .csv("src/main/resources/nested")
```
This recursively loads the files from src/main/resources/nested and it's subfolders.
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦如初夏

2020-12-30 03:46

If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']

0 讨论(0)

我寻月下人不归

2020-12-30 03:47
if you want use only files which start with name "a" ,you can use
```
sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")
```
as well. We can use * as wildcard.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-30 03:48
sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use
```
sc.textFile("/directory/201910*/part-*.lzo")
```
and setting reading directory recursive!
```
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
```
TIPS: scala differ with python, below set use to scala!
```
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...