I have the following data frame:
+---+---+------+
| id| ts|days_r|
+---+---+------+
|123| T| 32|
|342| I|
Your code has a bug- you are missing a set of parentheses on the third line. Here is a way to fix your code, and use chained when() statements instead of using multiple otherwise()
statements:
df = df.withColumn(
'0to2_count',
F.when((F.col("ts") == 'I') & (F.col("days_r") >=0) & (F.col("days_r") <= 2), 1)\
.when((F.col("ts") == 'T') & (F.col('days_r') >=0) & (F.col('days_r') <= 48), 1)\
.when((F.col("ts") == 'L') & (F.col('days_r') >=0) & (F.col('days_r') <= 7), 1)\
.otherwise(0)
)
An even better way to write this logic is to use pyspark.sql.Column.between():
df = df.withColumn(
'0to2_count',
F.when((F.col("ts") == 'I') & F.col("days_r").between(0, 2), 1)\
.when((F.col("ts") == 'T') & F.col('days_r').between(0,48), 1)\
.when((F.col("ts") == 'L') & F.col('days_r').between(0,7), 1)\
.otherwise(0)
)
df.show()
#+---+---+------+----------+
#| id| ts|days_r|0to2_count|
#+---+---+------+----------+
#|123| T| 32| 1|
#|342| I| 3| 0|
#|349| L| 10| 0|
#+---+---+------+----------+
Of course since the first three conditions return the same value, you could further simplify this into one Boolean logic condition.