Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

僤鯓⒐⒋嵵緔 提交于 2019-12-04 11:16:23

I will tell you what worked for me after struggling with this for an entire day.

My goal: run spark-submit commands from an EC2 instance outside the EMR cluster. The cluster uses S3 for storage (hive tables) and Glue Data Catalog for metastore:

  • Start your EMR cluster (with that Glue metastore config turned on, of course)
  • Create an AMI image from you master node
  • Boot up an EC2 instance from the image
    • Make sure your network configs allow communications between the cluster VMs and the instance from which you'll launch the job (subnets and security groups)
  • On the instance you just booted:

    • Update /etc/hadoop/conf/yarn-site.xml with:
    <property>    
       <name>yarn.timeline-service.enabled</name>
       <value>false</value>
    </property>
    

Now you should be able to submit your job in cluster mode. In order to do so in client mode you need to set AWS CREDENTIALS on this instance you created.

What was really missing:

  • Spark needs to load the jars for AWSGlueDataCatalogHiveClientFactory (check spark.driver.extraClassPath & spark.executor.extraClassPath in /etc/spark/conf/spark-defaults.conf)

  • Check also /etc/spark/hive-site.xml:

    <property>    
       <name>yarn.timeline-service.enabled</name>
       <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
    </property>
    
    • This configuration tells hive to use Glue Data Catalog for metastore, no need to put that in the code anymore!

After making it work, I also cleaned up some configuration:

  • You can get rid of the hive stuff (/etc/hive).

  • In /etc/spark/conf/spark-env.sh I only left the line that exports HADOOP_CONF_DIR

  • In /etc/spark/conf/spark-defaults.conf only left the following pieces of config:

    • spark.driver.extraClassPath
    • spark.driver.extraLibraryPath
    • spark.executor.extraClassPath
    • spark.executor.extraLibraryPath

I really just made this work, so I'll put back some configuration. The important thing now is to be sure of what I'm putting and why I'm adding those configs.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!