问题
I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.
I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.
I went through the official doc: https://github.com/Azure/azure-sqldb-spark. The library is written in scala and basically requires the use of 2 scala classes :
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"databaseName" -> "MyDatabase",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig)
Can it be implemented in used in pyspark like this (using sc._jvm):
Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._
//all config
df.connect.bulkCopyToSqlDB(bulkCopyConfig)
I am not an expert in Python. Can anybody help me with the complete snippet to get this done.
回答1:
The Spark connector currently (as of march 2019) only supports the Scala API (as documented here). So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :
df.createOrReplaceTempView('testbulk')
and have to do the final step in Scala:
%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)
来源:https://stackoverflow.com/questions/53019576/how-to-use-azure-sqldb-spark-connector-in-pyspark