Spark: optimise writing a DataFrame to SQL Server

前端 未结 3 1395
[愿得一人]
[愿得一人] 2021-02-08 19:01

I am using the code below to write a DataFrame of 43 columns and about 2,000,000 rows into a table in SQL Server:

dataFrame
  .write
  .format(\"jdbc\")
  .mode(         


        
相关标签:
3条回答
  • 2021-02-08 19:49

    is converting data to CSV files and copying those CSV's is an option for you? we have automated this process for bigger tables and transferring those in GCP in CSV format. rather than reading this through JDBC.

    0 讨论(0)
  • 2021-02-08 19:50

    Try adding batchsize option to your statement with atleast > 10000(change this value accordingly to get better performance) and execute the write again.

    From spark docs:

    The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.

    Also its worth to check out:

    • numPartitions option to increase the parallelism (This also determines the maximum number of concurrent JDBC connections)

    • queryTimeout option to increase the timeouts for the write option.

    0 讨论(0)
  • 2021-02-08 20:03

    We resorted to using the azure-sqldb-spark library instead of the default built-in exporting functionality of Spark. This library gives you a bulkCopyToSqlDB method which is a real batch insert and goes a lot faster. It's a bit less practical to use than the built-in functionality, but in my experience it's still worth it.

    We use it more or less like this:

    import com.microsoft.azure.sqldb.spark.config.Config
    import com.microsoft.azure.sqldb.spark.connect._
    import com.microsoft.azure.sqldb.spark.query._
    
    val options = Map(
      "url"          -> "***",
      "databaseName" -> "***",
      "user"         -> "***",
      "password"     -> "***",
      "driver"       -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    )
    
    // first make sure the table exists, with the correct column types
    // and is properly cleaned up if necessary
    val query = dropAndCreateQuery(df, "myTable")
    val createConfig = Config(options ++ Map("QueryCustom" -> query))
    spark.sqlContext.sqlDBQuery(createConfig)
    
    val bulkConfig = Config(options ++ Map(
      "dbTable"           -> "myTable",
      "bulkCopyBatchSize" -> "20000",
      "bulkCopyTableLock" -> "true",
      "bulkCopyTimeout"   -> "600"
    ))
    
    df.bulkCopyToSqlDB(bulkConfig)
    

    As you can see we generate the CREATE TABLE query ourselves. You can let the library create the table, but it will just do dataFrame.limit(0).write.sqlDB(config) which can still be pretty inefficient, probably requires you to cache your DataFrame, and it doesn't allow you to choose the SaveMode.

    Also potentially interesting: we had to use an ExclusionRule when adding this library to our sbt build, or the assembly task would fail.

    libraryDependencies += "com.microsoft.azure" % "azure-sqldb-spark" % "1.0.2" excludeAll(
      ExclusionRule(organization = "org.apache.spark")
    )
    
    0 讨论(0)
提交回复
热议问题