pyspark dataframe withColumn command not working

后端 未结 3 1006
独厮守ぢ
独厮守ぢ 2021-01-16 17:29

I have a input dataframe: df_input (updated df_input)

|comment|inp_col|inp_val|
|11     |a      |a1     |
|12     |a      |a2     |
         


        
3条回答
  •  有刺的猬
    2021-01-16 18:07

    Try this, self-join with collected list on rlike join condition is the way to go.

    df.show() #sampledataframe
    
    #+-------+---------+---------+
    #|comment|input_col|input_val|
    #+-------+---------+---------+
    #|     11|        a|        1|
    #|     12|        a|        2|
    #|     15|        b|        5|
    #|     16|        b|        6|
    #|     17|        c|       &b|
    #|     17|        c|        7|
    #+-------+---------+---------+
    
    df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
              .withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
      .withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
                        .otherwise(F.col("y1"))).drop("x1","y1").show()
    
    #+-------+---------+---------+-------+
    #|comment|input_col|input_val|new_col|
    #+-------+---------+---------+-------+
    #|     11|        a|        1|    [1]|
    #|     12|        a|        2|    [2]|
    #|     15|        b|        5|    [5]|
    #|     16|        b|        6|    [6]|
    #|     17|        c|       &b| [5, 6]|
    #|     17|        c|        7|    [7]|
    #+-------+---------+---------+-------+
    

提交回复
热议问题