Pseudocolumn in Spark JDBC

你离开我真会死。 提交于 2019-11-26 18:35:13

问题


I am using a query to fetch data from MYSQL as follows:

var df = spark.read.format("jdbc")
         .option("url", "jdbc:mysql://10.0.0.192:3306/retail_db")
         .option("driver" ,"com.mysql.jdbc.Driver")
         .option("user", "retail_dba")
         .option("password", "cloudera")
         .option("dbtable", "orders")
         .option("partitionColumn", "order_id")
         .option("lowerBound", "1")
         .option("upperBound", "68883")
         .option("numPartitions", "4")
         .load() 

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) with option where I specify the partitionColumn ?

If not, can we specify a partition column which is not a primary key ?


回答1:


can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2)

TL;DR Probably no.

While Spark doesn't consider constraints like PRIMARY KEY or UNIQUE there is very important requirement for partitionColumn, which is not explicitly stated in the documentation - it has to be deterministic.

Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.

Because ROWNUM implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn. For the same reason you cannot use random numbers.

Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM

Conditions testing for ROWNUM values greater than a positive integer are always false.

might fail silently leading to incorrect results.

can we specify a partition column which is not a primary key

Yes, as long it satisfies criteria described above.




回答2:


As per Spark's official documentation the partitionColumn can be any numeric column (not necessarily primary key column).

partitionColumn must be a numeric column from the table in question.

Reference: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases



来源:https://stackoverflow.com/questions/47615975/pseudocolumn-in-spark-jdbc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!