Spark Exception when converting a MySQL table to parquet

后端 未结 1 1854
轻奢々
轻奢々 2021-01-19 02:59

I\'m trying to convert a MySQL remote table to a parquet file using spark 1.6.2.

The process runs for 10 minutes, filling up memory, than starts with these messages:

1条回答
  •  悲哀的现实
    2021-01-19 03:34

    It seemed like the problem was that you had no partition defined when you read your data with the jdbc connector.

    Reading from JDBC isn't distributed by default, so to enable distribution you have to set manual partitioning. You need a column which is a good partitioning key and you have to know distribution up front.

    This is what your data looks like apparently :

    root 
    |-- id: long (nullable = false) 
    |-- order_year: string (nullable = false) 
    |-- order_number: string (nullable = false) 
    |-- row_number: integer (nullable = false) 
    |-- product_code: string (nullable = false) 
    |-- name: string (nullable = false) 
    |-- quantity: integer (nullable = false) 
    |-- price: double (nullable = false) 
    |-- price_vat: double (nullable = false) 
    |-- created_at: timestamp (nullable = true) 
    |-- updated_at: timestamp (nullable = true)
    

    order_year seemed like a good candidate to me. (you seem to have ~20 years according to your comments)

    import org.apache.spark.sql.SQLContext
    
    val sqlContext: SQLContext = ???
    
    val driver: String = ???
    val connectionUrl: String = ???
    val query: String = ???
    val userName: String = ???
    val password: String = ???
    
    // Manual partitioning
    val partitionColumn: String = "order_year"
    
    val options: Map[String, String] = Map("driver" -> driver,
      "url" -> connectionUrl,
      "dbtable" -> query,
      "user" -> userName,
      "password" -> password,
      "partitionColumn" -> partitionColumn,
      "lowerBound" -> "0",
      "upperBound" -> "3000",
      "numPartitions" -> "300"
    )
    
    val df = sqlContext.read.format("jdbc").options(options).load()
    

    PS: partitionColumn, lowerBound, upperBound, numPartitions: These options must all be specified if any of them is specified.

    Now you can save your DataFrame to parquet.

    0 讨论(0)
提交回复
热议问题