PySpark: multiple conditions in when clause

后端 未结 4 2153
一生所求
一生所求 2020-12-01 00:46

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the c

相关标签:
4条回答
  • 2020-12-01 01:17

    You get SyntaxError error exception because Python has no && operator. It has and and & where the latter one is the correct choice to create boolean expressions on Column (| for a logical disjunction and ~ for logical negation).

    Condition you created is also invalid because it doesn't consider operator precedence. & in Python has a higher precedence than == so expression has to be parenthesized.

    (col("Age") == "") & (col("Survived") == "0")
    ## Column<b'((Age = ) AND (Survived = 0))'>
    

    On a side note when function is equivalent to case expression not WHEN clause. Still the same rules apply. Conjunction:

    df.where((col("foo") > 0) & (col("bar") < 0))
    

    Disjunction:

    df.where((col("foo") > 0) | (col("bar") < 0))
    

    You can of course define conditions separately to avoid brackets:

    cond1 = col("Age") == "" 
    cond2 = col("Survived") == "0"
    
    cond1 & cond2
    
    0 讨论(0)
  • 2020-12-01 01:17

    when in pyspark multiple conditions can be built using &(for and) and | (for or).

    Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition

    %pyspark
    dataDF = spark.createDataFrame([(66, "a", "4"), 
                                    (67, "a", "0"), 
                                    (70, "b", "4"), 
                                    (71, "d", "4")],
                                    ("id", "code", "amt"))
    dataDF.withColumn("new_column",
           when((col("code") == "a") | (col("code") == "d"), "A")
          .when((col("code") == "b") & (col("amt") == "4"), "B")
          .otherwise("A1")).show()
    

    In Spark Scala code (&&) or (||) conditions can be used within when function

    //scala
    val dataDF = Seq(
          (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
          )).toDF("id", "code", "amt")
    dataDF.withColumn("new_column",
           when(col("code") === "a" || col("code") === "d", "A")
          .when(col("code") === "b" && col("amt") === "4", "B")
          .otherwise("A1")).show()
    

    =======================

    Output:
    +---+----+---+----------+
    | id|code|amt|new_column|
    +---+----+---+----------+
    | 66|   a|  4|         A|
    | 67|   a|  0|         A|
    | 70|   b|  4|         B|
    | 71|   d|  4|         A|
    +---+----+---+----------+
    

    This code snippet is copied from sparkbyexamples.com

    0 讨论(0)
  • it should works at least in pyspark 2.4

    tdata = tdata.withColumn("Age",  when((tdata.Age == "") & (tdata.Survived == "0") , "NewValue").otherwise(tdata.Age))
    
    0 讨论(0)
  • 2020-12-01 01:25

    It should be:

    $when(((tdata.Age == "" ) & (tdata.Survived == "0")), mean_age_0)
    
    0 讨论(0)
提交回复
热议问题