Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem
. By adding the following into the conf/spark-defaults.conf
, I can get spark
on EMR emr-5.16.0:
I've added the following to my cluster bootstrap:
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/
Then in the config of the cluster:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://some/path',
'spark.history.fs.logDirectory': 's3a://some/path',
'spark.eventLog.enabled': 'true'
}
}
If you're going to test this, first stop the spark history server:
sudo stop spark-history-server
Make the config changes
sudo vim /etc/spark/conf.dist/spark-defaults.conf
Then run the copying of JARs as above
Then restart the spark history server:
sudo /usr/lib/spark/sbin/start-history-server.sh
Thanks for the answers above!