I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0
Thank you Kalyan
I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive.
check this line @UDFType(deterministic = false, stateful = true)
in below code to make sure it's stateful UDF.
package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
* UDFRowSequence.
*/
@Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();
public UDFRowSequence() {
result.set(0);
}
public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}
// End UDFRowSequence.java
Now build the jar and add the location when pyspark get's started.
$ pyspark --jars your_jar_name.jar
Then register with sqlContext
.
sqlContext.sql("CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'")
Now use row_seq()
in select query
sqlContext.sql("SELECT row_seq(), col1, col2 FROM table_name")
Project to use Hive UDFs in pySpark