Spark RDD method “saveAsTextFile” throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

问题

I am calling this method on an RDD[String] with destination in the arguments. (Scala)

Even after deleting the directory before starting, the process gives this error. I am running this process on EMR cluster with output location at aws S3. Below is the command used:

spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g --driver-cores 4 s3://bi-aws-users/sbatheja/hotel-shopper-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d 3 -p 100 --search-bucket s3a://hda-prod-business.hotwire.hotel.search --prd-output-path s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput/

Log:

16/07/07 11:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/07 11:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/07 11:27:47 INFO SparkContext: Successfully stopped SparkContext
16/07/07 11:27:47 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: **org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput already exists)**
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/07 11:27:47 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/07/07 11:27:47 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1467889642439_0001
16/07/07 11:27:47 INFO ShutdownHookManager: Shutdown hook called
16/07/07 11:27:47 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1467889642439_0001/spark-7f836950-a040-4216-9308-2bb4565c5649

It creates "_temporary" directory in the location, which contains empty part files.

回答1:

In short, a word:
Make sure the scala version of spark-core and scala-library is consistent.

I encountered the same problem. As I saving the file to the HDFS, it throws an exception: org.apache.hadoop.mapred.FileAlreadyExistsException
Then I checked the HDFS file directory, there is a empty temporary folder: TARGET_DIR/_temporary/0.

You can submit the job, open the detailed configuration:./spark-submit --verbose. And then look at the full context and log, there must be other errors caused. My job in the RUNNING state, the first error is thrown:

17/04/23 11:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;

Then the job will be retried and re-executed. At this time, job re-implementation, it will find just the directory has been created. And also throws the directory already exists.

After confirming that the first error is version compatibility issues. The spark version is 2.1.0, the corresponding spark-core scala version is 2.11, and the scala-library dependency of the scala version is 2.12.xx.

When the two scala version of the change is consistent (usually modify the scala-library version), you can solve the first exception problem, then job can be normal FINISHED.
pom.xml example:

<!-- Spark -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.0</version>
</dependency>
<!-- scala -->
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.7</version>
</dependency>

来源：https://stackoverflow.com/questions/38244860/spark-rdd-method-saveastextfile-throwing-exception-even-after-deleting-the-out

标签

scala

amazon-web-services

apache-spark

rdd

emr