I am using a query to fetch data from MYSQL as follows:
var df = spark.read.format(\"jdbc\")
.option(\"url\", \"jdbc:mysql://10.0.0.192:3306/retai
As per Spark's official documentation the partitionColumn
can be any numeric column (not necessarily primary key column).
partitionColumn must be a numeric column from the table in question.
Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2)
TL;DR Probably no.
While Spark doesn't consider constraints like PRIMARY KEY
or UNIQUE
there is very important requirement for partitionColumn
, which is not explicitly stated in the documentation - it has to be deterministic.
Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.
Because ROWNUM
implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn
. For the same reason you cannot use random numbers.
Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM
Conditions testing for ROWNUM values greater than a positive integer are always false.
might fail silently leading to incorrect results.
can we specify a partition column which is not a primary key
Yes, as long it satisfies criteria described above.