pyspark dataframe filter or include based on list

后端 未结 3 803
自闭症患者
自闭症患者 2020-11-29 03:42

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below

相关标签:
3条回答
  • 2020-11-29 04:23

    what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"

    The code should be like this:

    # define a dataframe
    rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
    df = sqlContext.createDataFrame(rdd, ["id", "score"])
    
    # define a list of scores
    l = [10,18,20]
    
    # filter out records by scores by list l
    records = df.filter(~df.score.isin(l))
    # expected: (0,1), (0,1), (0,2), (1,2)
    
    # include only records with these scores in list l
    df.where(df.score.isin(l))
    # expected: (1,10), (1,20), (3,18), (3,18), (3,18)
    
    0 讨论(0)
  • 2020-11-29 04:30

    based on @user3133475 answer, it is also possible to call the isin() method from F.col() like this:

    import pyspark.sql.functions as F
    
    
    l = [10,18,20]
    df.filter(F.col("score").isin(l))
    
    0 讨论(0)
  • 2020-11-29 04:33

    I found the join implementation to be significantly faster than where for large dataframes:

    def filter_spark_dataframe_by_list(df, column_name, filter_list):
        """ Returns subset of df where df[column_name] is in filter_list """
        spark = SparkSession.builder.getOrCreate()
        filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
        return df.join(filter_df, df[column_name] == filter_df["value"])
    
    0 讨论(0)
提交回复
热议问题