Reading multiple files from S3 in parallel (Spark, Java)

前端未结

关注

 3  1447

天涯浪人 2021-02-04 10:52

I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:

3条回答

故里飘歌 (楼主)

2021-02-04 11:35

I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance

val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
        .flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }

0 讨论(0)

查看其它3个回答