PySpark: Create New Column And Fill In Based on Conditions of Two Other Columns

前端 未结 1 1648
迷失自我
迷失自我 2021-01-03 10:40

I have the following data frame:

+---+---+------+
| id| ts|days_r|
+---+---+------+
|123|  T|    32|
|342|  I|          


        
相关标签:
1条回答
  • 2021-01-03 11:03

    Your code has a bug- you are missing a set of parentheses on the third line. Here is a way to fix your code, and use chained when() statements instead of using multiple otherwise() statements:

    df = df.withColumn(
        '0to2_count',
        F.when((F.col("ts") == 'I') & (F.col("days_r") >=0) & (F.col("days_r") <= 2), 1)\
        .when((F.col("ts") == 'T') & (F.col('days_r') >=0) & (F.col('days_r') <= 48), 1)\
        .when((F.col("ts") == 'L') & (F.col('days_r') >=0) & (F.col('days_r') <= 7), 1)\
        .otherwise(0)
    )
    

    An even better way to write this logic is to use pyspark.sql.Column.between():

    df = df.withColumn(
        '0to2_count',
        F.when((F.col("ts") == 'I') & F.col("days_r").between(0, 2), 1)\
        .when((F.col("ts") == 'T') & F.col('days_r').between(0,48), 1)\
        .when((F.col("ts") == 'L') & F.col('days_r').between(0,7), 1)\
        .otherwise(0)
    )
    df.show()
    #+---+---+------+----------+
    #| id| ts|days_r|0to2_count|
    #+---+---+------+----------+
    #|123|  T|    32|         1|
    #|342|  I|     3|         0|
    #|349|  L|    10|         0|
    #+---+---+------+----------+
    

    Of course since the first three conditions return the same value, you could further simplify this into one Boolean logic condition.

    0 讨论(0)
提交回复
热议问题