Pyspark alter column with substring

前端 未结 4 1883
耶瑟儿~
耶瑟儿~ 2021-01-04 08:20

Pyspark n00b... How do I replace a column with a substring of itself? I\'m trying to remove a select number of characters from the start and end of string.

f         


        
4条回答
  •  执笔经年
    2021-01-04 08:50

    The accepted answer uses a udf (user defined function), which is usually (much) slower than native spark code. Grant Shannon's answer does use native spark code, but as noted in the comments by citynorman, it is not 100% clear how this works for variable string lengths.

    Answer with native spark code (no udf) and variable string length

    From the documentation of substr in pyspark, we can see that the arguments: startPos and length can be either int or Column types (both must be the same type). So we just need to create a column that contains the string length and use that as argument.

    import pyspark.sql.functions as sf
    
    result = (
        df
        .withColumn('length', sf.length('COLUMN_NAME'))
        .withColumn('fixed_in_spark', col('COLUMN_NAME').substr(sf.lit(2), col('length') - sf.lit(2)))
    )
    
    # result:
    +----------------+---------------+----+--------------+
    |     COLUMN_NAME|COLUMN_NAME_fix|size|fixed_in_spark|
    +----------------+---------------+----+--------------+
    |        _string_|         string|   8|        string|
    |_another string_| another string|  16|another string|
    +----------------+---------------+----+--------------+
    

    Note:

    • We use length - 2 because we start from the second character (and need everything up to the 2nd last).
    • We need to use sf.lit because we cannot add (or subtract) a number to a Column object. We need to first convert that number into a Column.

提交回复
热议问题