How to implement auto increment in spark SQL(PySpark)

前端 未结 1 741
耶瑟儿~
耶瑟儿~ 2021-01-14 17:20

I need to implement a auto increment column in my spark sql table, how could i do that. Kindly guide me. i am using pyspark 2.0

Thank you Kalyan

相关标签:
1条回答
  • 2021-01-14 17:29

    I would write/reuse stateful Hive udf and register with pySpark as Spark SQL does have good support for Hive.

    check this line @UDFType(deterministic = false, stateful = true) in below code to make sure it's stateful UDF.

    package org.apache.hadoop.hive.contrib.udf;
    
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.hive.ql.udf.UDFType;
    import org.apache.hadoop.io.LongWritable;
    
    /**
     * UDFRowSequence.
     */
    @Description(name = "row_sequence",
        value = "_FUNC_() - Returns a generated row sequence number starting from 1")
    @UDFType(deterministic = false, stateful = true)
    public class UDFRowSequence extends UDF
    {
      private LongWritable result = new LongWritable();
    
      public UDFRowSequence() {
        result.set(0);
      }
    
      public LongWritable evaluate() {
        result.set(result.get() + 1);
        return result;
      }
    }
    
    // End UDFRowSequence.java
    

    Now build the jar and add the location when pyspark get's started.

    $ pyspark --jars your_jar_name.jar
    

    Then register with sqlContext.

    sqlContext.sql("CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'")
    

    Now use row_seq() in select query

    sqlContext.sql("SELECT row_seq(), col1, col2 FROM table_name")
    

    Project to use Hive UDFs in pySpark

    0 讨论(0)
提交回复
热议问题