I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place
Add the following to this file hadoop/etc/hadoop/core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>***</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>***</value>
</property>
Inside the Hadoop installation directory, find aws jars, for MAC installation directory is /usr/local/Cellar/hadoop/
find . -type f -name "*aws*"
sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/
Credit
If nothing works in the above then do a cat and grep for the missing class. High possibility that the Jar is corrupted. For example, if you get class AmazonServiceException not found, then do a grep where the jar is already present as shown below.
grep "AmazonServiceException" *.jar
I was able to address the above to make sure I had the correct versions of the hadoop aws jar per the version of spark hadoop that I was running, downloading the correct version of aws-java-sdk
, and lastly downloading the dependency jets3t library
In the /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
Testing it out
scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", [ACCESS KEY ID])
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )
scala> val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")
scala> myRDD.count()
res2: Long = 49
Following worked for me
My system config:
Ubuntu 16.04.6 LTS python3.7.7 openjdk version 1.8.0_252 spark-2.4.5-bin-hadoop2.7
Configure PYSPARK_PYTHON path: add following line in $spark_home/conf/spark-env.sh
export PYSPARK_PYTHON= python_env_path/bin/python
Start pyspark
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.11.760,org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-2.amazonaws.com
com.amazonaws:aws-java-sdk-pom:1.11.760 : depends on jdk version hadoop:hadoop-aws:2.7.0: depends on your hadoop version s3.us-west-2.amazonaws.com: depends on your s3 location
3.Read data from s3
df2=spark.read.parquet("s3a://s3location_file_path")
Credits