Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem
. By adding the following into the conf/spark-defaults.conf
, I can get spark
I added the following jars into my SPARK_HOME/jars directory and it works great:
Edit :
And my spark_defaults.conf has below 3 parameters set :
spark.eventLog.enabled : true
spark.eventLog.dir : s3a://bucket_name/folder_name
spark.history.fs.logDirectory : s3a://bucket_name/folder_name
on EMR emr-5.16.0:
I've added the following to my cluster bootstrap:
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/
Then in the config of the cluster:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://some/path',
'spark.history.fs.logDirectory': 's3a://some/path',
'spark.eventLog.enabled': 'true'
}
}
If you're going to test this, first stop the spark history server:
sudo stop spark-history-server
Make the config changes
sudo vim /etc/spark/conf.dist/spark-defaults.conf
Then run the copying of JARs as above
Then restart the spark history server:
sudo /usr/lib/spark/sbin/start-history-server.sh
Thanks for the answers above!
Did some more digging and figured it out. Here's what was wrong:
The JARs necessary for S3A can be added to $SPARK_HOME/jars
(as described in SPARK-15965)
The line
spark.history.provider org.apache.hadoop.fs.s3a.S3AFileSystem
in $SPARK_HOME/conf/spark-defaults.conf
will cause
Exception in thread "main" java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(org.apache.spark.SparkConf)
exception. That line can be safely removed as suggested in this answer.
To summarize:
I added the following JARs to $SPARK_HOME/jars
:
and added this line to $SPARK_HOME/conf/spark-defaults.conf
spark.history.fs.logDirectory s3a://spark-logs-test/
You'll need some other configuration to enable logging in the first place, but once the S3 bucket has the logs, this is the only configuration that is needed for the History Server.