Spark History Server on S3A FileSystem: ClassNotFoundException

前端 未结 3 2073
隐瞒了意图╮
隐瞒了意图╮ 2021-02-06 12:03

Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem. By adding the following into the conf/spark-defaults.conf, I can get spark

相关标签:
3条回答
  • 2021-02-06 12:12

    I added the following jars into my SPARK_HOME/jars directory and it works great:

    • hadoop-aws-*.jar (Version must be same as hadoop-common which you have)
    • aws-java-sdk-s3-*.jar (Choose the one compatible with hadoop-aws jar)
    • aws-java-sdk-*.jar (Choose same version as above one)
    • aws-java-sdk-core-*.jar (Choose same version as above one)
    • aws-java-sdk-dynamodb-*.jar (Choose same version as above, Frankly not sure why this is needed but doesn't work for me without this jar).

    Edit :

    And my spark_defaults.conf has below 3 parameters set :

    spark.eventLog.enabled : true
    spark.eventLog.dir : s3a://bucket_name/folder_name
    spark.history.fs.logDirectory : s3a://bucket_name/folder_name
    
    0 讨论(0)
  • 2021-02-06 12:25

    on EMR emr-5.16.0:

    I've added the following to my cluster bootstrap:

    sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
    sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
    sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/
    

    Then in the config of the cluster:

            {
              'Classification': 'spark-defaults',
              'Properties': {
                'spark.eventLog.dir': 's3a://some/path',
                'spark.history.fs.logDirectory': 's3a://some/path',
                'spark.eventLog.enabled': 'true'
              }
            }
    

    If you're going to test this, first stop the spark history server:

    sudo stop spark-history-server
    

    Make the config changes

    sudo vim /etc/spark/conf.dist/spark-defaults.conf
    

    Then run the copying of JARs as above

    Then restart the spark history server:

    sudo /usr/lib/spark/sbin/start-history-server.sh
    

    Thanks for the answers above!

    0 讨论(0)
  • 2021-02-06 12:30

    Did some more digging and figured it out. Here's what was wrong:

    1. The JARs necessary for S3A can be added to $SPARK_HOME/jars (as described in SPARK-15965)

    2. The line

      spark.history.provider     org.apache.hadoop.fs.s3a.S3AFileSystem
      

      in $SPARK_HOME/conf/spark-defaults.conf will cause

      Exception in thread "main" java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(org.apache.spark.SparkConf)
      

      exception. That line can be safely removed as suggested in this answer.

    To summarize:

    I added the following JARs to $SPARK_HOME/jars:

    • jets3t-0.9.3.jar (may be already present with your pre-built Spark binaries, seems to not matter which 0.9.x version)
    • guava-14.0.1.jar (may be already present with your pre-built Spark binaries, seems to not matter which 14.0.x version)
    • aws-java-sdk-1.7.4.jar (must be 1.7.4)
    • hadoop-aws.jar (version 2.7.3) (probably should match the version of Hadoop in your Spark build)

    and added this line to $SPARK_HOME/conf/spark-defaults.conf

    spark.history.fs.logDirectory     s3a://spark-logs-test/
    

    You'll need some other configuration to enable logging in the first place, but once the S3 bucket has the logs, this is the only configuration that is needed for the History Server.

    0 讨论(0)
提交回复
热议问题