I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
Try this, self-join
with collected list
on rlike join condition
is the way to go.
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| 1|
#| 12| a| 2|
#| 15| b| 5|
#| 16| b| 6|
#| 17| c| &b|
#| 17| c| 7|
#+-------+---------+---------+
df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1").show()
#+-------+---------+---------+-------+
#|comment|input_col|input_val|new_col|
#+-------+---------+---------+-------+
#| 11| a| 1| [1]|
#| 12| a| 2| [2]|
#| 15| b| 5| [5]|
#| 16| b| 6| [6]|
#| 17| c| &b| [5, 6]|
#| 17| c| 7| [7]|
#+-------+---------+---------+-------+