How do I get the last item from a list using pyspark?

后端 未结 4 894
感情败类
感情败类 2021-01-12 03:04

Why does column 1st_from_end contain null:

from pyspark.sql.functions import split
df = sqlContext.createDataFrame([(\'a b c d\',)], [\'s\',])
d         


        
相关标签:
4条回答
  • 2021-01-12 03:36

    If you're using Spark >= 2.4.0 see jxc's answer below.

    In Spark < 2.4.0, dataframes API didn't support -1 indexing on arrays in Spark < 2.4.0, but you could write own UDF or use built-in size() function, for example:

    >>> from pyspark.sql.functions import size
    >>> splitted = df.select(split(df.s, ' ').alias('arr'))
    >>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
    +--------------------+
    |arr[(size(arr) - 1)]|
    +--------------------+
    |                   d|
    +--------------------+
    
    0 讨论(0)
  • 2021-01-12 03:52

    Building on jamiet 's solution, we can simplify even further by removing a reverse

    from pyspark.sql.functions import split, reverse
    
    df = sqlContext.createDataFrame([('a b c d',)], ['s',])
    df.select(   split(df.s, ' ')[0].alias('0th'),
                 split(df.s, ' ')[3].alias('3rd'),
                 reverse(split(df.s, ' '))[-1].alias('1st_from_end')
             ).show()
    
    0 讨论(0)
  • 2021-01-12 03:54

    Create your own udf would look like this

        def get_last_element(l):
            return l[-1]
        get_last_element_udf = F.udf(get_last_element)
    
        df.select(get_last_element(split(df.s, ' ')).alias('1st_from_end')
    
    0 讨论(0)
  • 2021-01-12 03:55

    For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:

    element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

    from pyspark.sql.functions import element_at, split, col
    
    df = spark.createDataFrame([('a b c d',)], ['s',])
    
    df.withColumn('arr', split(df.s, ' ')) \
      .select( col('arr')[0].alias('0th')
             , col('arr')[3].alias('3rd')
             , element_at(col('arr'), -1).alias('1st_from_end')
         ).show()
    
    +---+---+------------+
    |0th|3rd|1st_from_end|
    +---+---+------------+
    |  a|  d|           d|
    +---+---+------------+
    
    0 讨论(0)
提交回复
热议问题