Filtering a Pyspark DataFrame with SQL-like IN clause

前端 未结 5 1618
清酒与你
清酒与你 2020-11-27 02:54

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql(\'SELECT * from my         


        
相关标签:
5条回答
  • 2020-11-27 03:39

    String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

    df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
    df.registerTempTable("df")
    sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
    ##  2 
    

    Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

    In practice DataFrame DSL is a much choice when you want to create dynamic queries:

    from pyspark.sql.functions import col
    
    df.where(col("v").isin({"foo", "bar"})).count()
    ## 2
    

    It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

    0 讨论(0)
  • 2020-11-27 03:47

    reiterating what @zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below

    from pyspark.sql.functions import col
    
    df.where(col("v").isin(["foo", "bar"])).count()
    
    0 讨论(0)
  • 2020-11-27 03:48

    A slightly different approach that worked for me is to filter with a custom filter function.

    def filter_func(a):
    """wrapper function to pass a in udf"""
        def filter_func_(col):
        """filtering function"""
            if col in a.value:
                return True
    
        return False
    
    return udf(filter_func_, BooleanType())
    
    # Broadcasting allows to pass large variables efficiently
    a = sc.broadcast((1, 2, 3))
    df = my_df.filter(filter_func(a)(col('field1'))) \
    
    0 讨论(0)
  • 2020-11-27 03:52

    Just a little addition/update:

    choice_list = ["foo", "bar", "jack", "joan"]
    

    If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then

    df_filtered = df.where( ( col("v").isin (choice_list) ) )
    
    0 讨论(0)
  • 2020-11-27 03:59

    You can also do this for integer columns:

    df_filtered = df.filter("field1 in (1,2,3)")
    

    or this for string columns:

    df_filtered = df.filter("field1 in ('a','b','c')")
    
    0 讨论(0)
提交回复
热议问题