问题
I'm using Amazon EMR. I have some log data in s3, all in the same bucket, but under different subdirectories like:
"s3://bucketname/2014/08/01/abc/file1.bz"
"s3://bucketname/2014/08/01/abc/file2.bz"
"s3://bucketname/2014/08/01/xyz/file1.bz"
"s3://bucketname/2014/08/01/xyz/file3.bz"
I'm using :
Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;
When trying to load all data from "s3://bucketname/2014/08/":
CREATE EXTERNAL TABLE table1(id string, at string,
custom struct<param1:string, param2:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucketname/2014/08/';
In return I get:
OK
Time taken: 0.169 seconds
When trying to query the table:
SELECT * FROM table1 LIMIT 10;
I get:
Failed with exception java.io.IOException:java.io.IOException: Not a file: s3://bucketname/2014/08/01
Does anyone has an idea on how to solev this?
回答1:
It's an EMR specific problem, here is what i got from Amazon support:
Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this document ("Are you trying to recursively traverse input directories?") Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.
回答2:
This works now (May 2018)
A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf
file:
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
hive.mapred.supports.subdirectories true
Or, can be fixed locally like in following pyspark code:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
.config("hive.mapred.supports.subdirectories","true") \
.getOrCreate()
spark.sql("<YourQueryHere>").show()
回答3:
The problem is the way you have specified the location
s3://bucketname/2014/08/
The hive external table expect files to be present at this location but it has folders.
Try putting path like
"s3://bucketname/2014/08/01/abc/,s3://bucketname/2014/08/01/xyz/"
You need to provide path till files.
来源:https://stackoverflow.com/questions/25708240/amazon-emr-and-hive-getting-a-java-io-ioexception-not-a-file-exception-when