Multiple condition filter on dataframe

后端 未结 2 1995
栀梦
栀梦 2020-12-09 09:39

Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:

df.filter(\"act_date <=\'2017         


        
相关标签:
2条回答
  • 2020-12-09 10:08

    TL;DR To pass multiple conditions to filter or where use Column objects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

    df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))
    

    You can also use a single SQL string:

    df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")
    

    In practice it makes more sense to use between:

    df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
    df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")
    

    The first approach is not even remote valid. In Python, and returns:

    • The last element if all expressions are "truthy".
    • The first "falsey" element otherwise.

    As a result

    "act_date <='2017-04-01'" and "act_date >='2016-10-01'"
    

    is evaluated to (any non-empty string is truthy):

    "act_date >='2016-10-01'"
    
    0 讨论(0)
  • 2020-12-09 10:16

    In first case

    df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
      .select("col1","col2").distinct().count()
    

    the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

    Whereas in second case

    df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
      .select("col1","col2").distinct().count()
    

    the result is the values between 2016-10-01 to 2017-04-01.

    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题