How can I access S3/S3n from a local Hadoop 2.6 installation?

我是研究僧i 提交于 2019-12-17 06:35:28

问题


I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster.

I have added the aws credentials in core-site.xml:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>some id</value>
</property>

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>some id</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>some key</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>some key</value>
</property>

Note: Since there are some slashes on the key, I have escaped them with %2F

If I try to list the contents of the bucket:

hadoop fs -ls s3://some-url/bucket/

I get this error:

ls: No FileSystem for scheme: s3

I edited core-site.xml again, and added information related to the fs:

<property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
  <name>fs.s3n.impl</name>
  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

This time I get a different error:

-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.


回答1:


For some reason, the jar hadoop-aws-[version].jar which contains the implementation to NativeS3FileSystem is not present in the classpath of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Assuming you are using Apache Hadoop 2.6 or 2.7

By the way, you could check the classpath of Hadoop using:

bin/hadoop classpath



回答2:


import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input() 
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://myBucket/myKey")



回答3:


@Ashrith's answer worked for me with one modification: I had to use $HADOOP_PREFIX rather than $HADOOP_HOME when running v2.6 on Ubuntu. Perhaps this is because it sounds like $HADOOP_HOME is being deprecated?

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*

Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*




回答4:


To resolve this issue I tried all the above, which failed (for my environment anyway).

However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.

Worked fine after that.




回答5:


If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.

mapreduce.application.classpath

Append the following value to the end of the existing string:

/usr/hdp/${hdp.version}/hadoop-mapreduce/*



来源:https://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!