Pseudocolumn in Spark JDBC

后端 未结 2 1596
清酒与你
清酒与你 2020-12-04 03:06

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.format(\"jdbc\")
         .option(\"url\", \"jdbc:mysql://10.0.0.192:3306/retai         


        
相关标签:
2条回答
  • 2020-12-04 03:17

    As per Spark's official documentation the partitionColumn can be any numeric column (not necessarily primary key column).

    partitionColumn must be a numeric column from the table in question.

    Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

    0 讨论(0)
  • 2020-12-04 03:28

    can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2)

    TL;DR Probably no.

    While Spark doesn't consider constraints like PRIMARY KEY or UNIQUE there is very important requirement for partitionColumn, which is not explicitly stated in the documentation - it has to be deterministic.

    Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.

    Because ROWNUM implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn. For the same reason you cannot use random numbers.

    Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM

    Conditions testing for ROWNUM values greater than a positive integer are always false.

    might fail silently leading to incorrect results.

    can we specify a partition column which is not a primary key

    Yes, as long it satisfies criteria described above.

    0 讨论(0)
提交回复
热议问题