I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:
I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }