问题
Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via
spark.catalog.setCurrentDatabase("test")
spark.catalog.listTables
However when I submit a job via spark-submit
I get a fatal error
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.;
I am creating my SparkSession within the job being submitted via spark-submit
via
SparkSession.builder.enableHiveSupport.getOrCreate
回答1:
Adding the hive.metastore.client.factory.class
configuration to the code initiating the spark session solved the issue for me:
SparkSession spark = SparkSession.builder()
...
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
that's the same configuration defined in aws docs (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) and added to the cluster configuration when checking Use for Hive table metadata
on cluster creation, but for some reason dosn't work as expected (I'm using emr 5.12.0).
回答2:
I had the same issue: spark-submit
will not discover the AWS Glue libraries, but spark-shell
working on the master node will.
It turns out that my spark-submit
job uses a fat .jar
which was compiled with the standard org.apache.spark
and org.apache.hive
libraries. The jar libraries were being used in stead of the custom classes installed on EMR
.
If this is the case with you, make sure to exclude all:
'org.apache.spark:' 'org.apache.hive:' 'org.apache.hadoop:' modules from you
.jar
Here is the reference I used for .Gradle
: http://unethicalblogger.com/2015/07/15/gradle-goodness-excluding-depends-from-shadow.html.
Adding compileOnly
keyword in front of all spark libraries fixed it.
回答3:
Our issue was IAM permissions on the EMR cluster; make sure that the cluster IAM instance profile has full access to glue.
回答4:
EMR 5.9.0 has just been released - please give it a shot, it should work for you.
Relevant documentation:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
来源:https://stackoverflow.com/questions/46291314/spark-catalog-w-aws-glue-database-not-found