问题
We are using sagemaker instance to connect to EMR in AWS. We are having some pyspark scripts that unloads athena tables and processes them as part of pipeline.
We are using athena tables using glue catalog but when we try to run the job via spark submit, our job fails
Code snippet
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def process_data():
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession.builder\
.config("spark.sql.catalogImplementation", "hive")\
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("hive.metastore.schema.verification", "false") \
.config("spark.hadoop.metastore.catalog.default","hive") \
.enableHiveSupport() \
.getOrCreate()
df1 = spark.read.table(“db1.tb1”)
df2 = spark.read.table("db1.tb2”)
print(df1.count())
print(df2.count())
if __name__ == "__main__":
process_data()
Error message :
pyspark.sql.utils.IllegalArgumentException: 'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'
Request
How can we ensure that the python script running on sagemaker instance uses athena tables.
来源:https://stackoverflow.com/questions/65577519/unable-to-locate-hive-jars-to-connect-to-metastore-while-using-pyspark-job-to