Amazon EMR and Hive: Getting a “java.io.IOException: Not a file” exception when loading subdirectories to an external table

半城伤御伤魂 提交于 2020-01-02 09:58:29

问题


I'm using Amazon EMR. I have some log data in s3, all in the same bucket, but under different subdirectories like:

"s3://bucketname/2014/08/01/abc/file1.bz"
"s3://bucketname/2014/08/01/abc/file2.bz"
"s3://bucketname/2014/08/01/xyz/file1.bz"
"s3://bucketname/2014/08/01/xyz/file3.bz"

I'm using :

Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;

When trying to load all data from "s3://bucketname/2014/08/":

CREATE EXTERNAL TABLE table1(id string, at string, 
          custom struct<param1:string, param2:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucketname/2014/08/';

In return I get:

OK
Time taken: 0.169 seconds

When trying to query the table:

SELECT * FROM table1 LIMIT 10;

I get:

Failed with exception java.io.IOException:java.io.IOException: Not a file: s3://bucketname/2014/08/01

Does anyone has an idea on how to solev this?


回答1:


It's an EMR specific problem, here is what i got from Amazon support:

Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this document ("Are you trying to recursively traverse input directories?") Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.



回答2:


This works now (May 2018)

A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf file:

spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive  true
hive.mapred.supports.subdirectories  true

Or, can be fixed locally like in following pyspark code:

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .enableHiveSupport() \
 .config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
        .config("hive.mapred.supports.subdirectories","true") \
        .getOrCreate()
        spark.sql("<YourQueryHere>").show()



回答3:


The problem is the way you have specified the location

s3://bucketname/2014/08/

The hive external table expect files to be present at this location but it has folders.

Try putting path like

"s3://bucketname/2014/08/01/abc/,s3://bucketname/2014/08/01/xyz/"

You need to provide path till files.



来源:https://stackoverflow.com/questions/25708240/amazon-emr-and-hive-getting-a-java-io-ioexception-not-a-file-exception-when

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!