Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

问题

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.

AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html

Observations:

- Using spark-shell (From EMR Master Node):
Works. Able to access Glue DB/Tables using below commands:
spark.catalog.setCurrentDatabase("test_db")
spark.catalog.listTables
- Using spark-submit (From EMR Step):

Does not work. Keep getting the error "Database 'test_db' does not exist"

Error Trace is as below:

INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.; at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44) at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64) at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97) at org.griffin_test.GriffinTest.main(GriffinTest.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)

After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.

Reference Blogs:

https://forums.aws.amazon.com/thread.jspa?threadID=263860
Spark Catalog w/ AWS Glue: database not found
https://okera.zendesk.com/hc/en-us/articles/360005768434-How-can-we-configure-Spark-to-use-the-Hive-Metastore-for-metadata-

Fixes Tried:

- Enabling Hive support in spark-defaults.conf & SparkSession (Code):
Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:
spark.sql.catalogImplementation  hive
Adding Hive metastore config:
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")

Code Snippet:

SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
                        .config("spark.sql.catalogImplementation", "hive")
                        .config("hive.metastore.connect.retries", 15) 
                        .config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
                        .enableHiveSupport()
                        .getOrCreate();

Any suggestions in figuring out the root cause for this discrepancy would be really helpful.

Appreciate your help! Thank you!

回答1:

I will tell you what worked for me after struggling with this for an entire day.

My goal: run spark-submit commands from an EC2 instance outside the EMR cluster. The cluster uses S3 for storage (hive tables) and Glue Data Catalog for metastore:

Start your EMR cluster (with that Glue metastore config turned on, of course)
Create an AMI image from you master node
Boot up an EC2 instance from the image
- Make sure your network configs allow communications between the cluster VMs and the instance from which you'll launch the job (subnets and security groups)
On the instance you just booted:
- Update /etc/hadoop/conf/yarn-site.xml with:
```
<property>    
   <name>yarn.timeline-service.enabled</name>
   <value>false</value>
</property>
```
- SSH into your EMR master node and add the user that submits your jobs to the hadoop group. Use this guide, jump to the Common errors section: https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

Now you should be able to submit your job in cluster mode. In order to do so in client mode you need to set AWS CREDENTIALS on this instance you created.

What was really missing:

Spark needs to load the jars for AWSGlueDataCatalogHiveClientFactory (check spark.driver.extraClassPath & spark.executor.extraClassPath in /etc/spark/conf/spark-defaults.conf)

Check also /etc/spark/hive-site.xml:

<property>    
   <name>yarn.timeline-service.enabled</name>
   <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>
This configuration tells hive to use Glue Data Catalog for metastore, no need to put that in the code anymore!

After making it work, I also cleaned up some configuration:

You can get rid of the hive stuff (/etc/hive).
In /etc/spark/conf/spark-env.sh I only left the line that exports HADOOP_CONF_DIR
In /etc/spark/conf/spark-defaults.conf only left the following pieces of config:
- spark.driver.extraClassPath
- spark.driver.extraLibraryPath
- spark.executor.extraClassPath
- spark.executor.extraLibraryPath

I really just made this work, so I'll put back some configuration. The important thing now is to be sure of what I'm putting and why I'm adding those configs.

来源：https://stackoverflow.com/questions/54118523/issue-with-aws-glue-data-catalog-as-metastore-for-spark-sql-on-emr

标签

apache-spark

amazon-emr

aws-glue

hive-metastore

aws-glue-data-catalog