Reading multiple files from S3 in parallel (Spark, Java)

前端未结

关注

 3  1443

天涯浪人 2021-02-04 10:52

I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:

3条回答

情歌与酒 (楼主)

2021-02-04 11:32
the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).

The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.

After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
```
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
```
And as requested, here's the java code used.
```
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey())); 
// repeat while objectListing truncated 
JavaRDD events = sc.textFile(String.join(",", keys))
```
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...