Reading multiple files from S3 in parallel (Spark, Java)

前端 未结 3 1447
天涯浪人
天涯浪人 2021-02-04 10:52

I saw a few discussions on this but couldn\'t quite understand the right solution: I want to load a couple hundred files from S3 into an RDD. Here is how I\'m doing it now:

3条回答
  •  故里飘歌
    2021-02-04 11:35

    I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance

    val bucketName=xxx
    val keyname=xxx
    val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
            .flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }
    

提交回复
热议问题