Tables not found in Spark SQL after migrating from EMR to AWS Glue

耗尽温柔 提交于 2019-12-25 01:18:40

问题


I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.

I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")

Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.

I can work around this by using Glue APIs and registering dataframes as temp views:

create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)

but is there a way to do this automatically?


回答1:


This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently. When you create a new job, you'll find the following option

Use Glue data catalog as the Hive metastore

You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog in the job parameters providing no value




回答2:


Instead of using SparkContext.getOrCreate(), you should use SparkSession.builder().enableHiveSupport().getOrCreate(), with enableHiveSupport() being the important part that's missing. I think what's probably happening is that your Spark job is not actually creating your tables in Glue but rather is creating them in Spark's embedded Hive metastore, since you have not enabled Hive support.




回答3:


Had the same problem. It was working on my Dev endpoint but not the actual ETL job. It is fixed by editing the job from Spark 2.2 to Spark 2.4.



来源:https://stackoverflow.com/questions/54596569/tables-not-found-in-spark-sql-after-migrating-from-emr-to-aws-glue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!