问题
I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findspark
findspark.init('C:\Spark\spark-2.4.5-bin-hadoop2.7')
#import required modules
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
import pandas as pd
#Create spark configuration object
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
#Create spark context and sparksession
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
table = "dbo.test"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName=Demo;integratedSecurity=true;") \
.option("dbtable", table) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
#show the data loaded into dataframe
#jdbcDF.show()
sqlQueries="execute testJoin"
resultDF=spark.sql(sqlQueries)
resultDF.show(resultDF.count(),False)
This doesn't work — how do I do it?
回答1:
Running a stored procedure through a JDBC connection from azure databricks is not supported as of now. But your options are:
Use a
pyodbc
library to connect and execute your procedure. But by using this library, it means that you will be running your code on the driver node while all your workers are idle. See this article for details. https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pysparkUse a
SQL
table function rather than procedures. In a sense, you can use anything that you can use in theFORM
clause of a SQL query.Since you are in an azure environment, then using a combination of azure data factory (to execute your procedure) and azure databricks can help you to build pretty powerful pipelines.
来源:https://stackoverflow.com/questions/60354376/how-to-execute-a-stored-procedure-in-azure-databricks-pyspark