Unable to locate hive jars to connect to metastore : while using pyspark job to connect to athena tables

问题

We are using sagemaker instance to connect to EMR in AWS. We are having some pyspark scripts that unloads athena tables and processes them as part of pipeline.

We are using athena tables using glue catalog but when we try to run the job via spark submit, our job fails

Code snippet

from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe

def process_data():
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    spark = SparkSession.builder\
    .config("spark.sql.catalogImplementation", "hive")\
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("hive.metastore.schema.verification", "false") \
    .config("spark.hadoop.metastore.catalog.default","hive") \
    .enableHiveSupport() \
    .getOrCreate()
    df1 = spark.read.table(“db1.tb1”)
    df2 = spark.read.table("db1.tb2”)
    print(df1.count())
    print(df2.count())

if __name__ == "__main__":
    process_data()

Error message :

pyspark.sql.utils.IllegalArgumentException: 'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'

Request

How can we ensure that the python script running on sagemaker instance uses athena tables.

来源：https://stackoverflow.com/questions/65577519/unable-to-locate-hive-jars-to-connect-to-metastore-while-using-pyspark-job-to

标签

amazon-web-services

apache-spark

pyspark

amazon-emr

amazon-athena

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!