Filter by whether column value equals a list in Spark

后端未结

关注

 3  1194

I\'m trying to filter a Spark dataframe based on whether the values in a column equal a list. I would like to do something like this:

filtered_df = df.where(


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-06 06:57
              
            
            
                                                                       
You might create a udf. For example:

def test_in(x):
    return x == ['list','of' , 'stuff']

from pyspark.sql.functions import udf
f = udf(test_in, pyspark.sql.types.BooleanType())
filtered_df = df.where(f(df.a))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-12-06 07:08
              
            
            
                                                                       
Update:

With current versions you can use an array of literals:

from pyspark.sql.functions import array, lit

df.where(df.a == array(*[lit(x) for x in ['list','of' , 'stuff']]))


Original answer:

Well, a little bit hacky way to do it, which doesn't require a Python batch job, is something like this:

from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_

def array_equal(c, an_array):
    same_size = size(c) == len(an_array)  # Check if the same size
    # Check if all items equal
    same_items = reduce(
        and_, 
        (c.getItem(i) == an_array[i] for i in range(len(an_array)))
    )
    return and_(same_size, same_items)


Quick test:

df = sc.parallelize([
    (1, ['list','of' , 'stuff']),
    (2, ['foo', 'bar']),
    (3, ['foobar']),
    (4, ['list','of' , 'stuff', 'and', 'foo']),
    (5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])

df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id|                a|
## +---+-----------------+
## |  1|[list, of, stuff]|
## +---+-----------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-06 07:16
              
            
            
                                                                       
You can use a combination of "array", "lit" and "array_except" function to achieve this.


We create an array column using lit(array(lit("list"),lit("of"),lit("stuff"))
Then we used array_exept function to get the values present in first array and not present in second array.
Then we filter for empty result array which means all the elements in first array are same as of ["list", "of", "stuff"]



  Note: array_except function is available from spark 2.4.0.


Here is the code:

# Import libraries
from pyspark.sql.functions import *

# Create DataFrame
df = sc.parallelize([
    (1, ['list','of' , 'stuff']),
    (2, ['foo', 'bar']),
    (3, ['foobar']),
    (4, ['list','of' , 'stuff', 'and', 'foo']),
    (5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])

# Solution
df1 = df.filter(size(array_except(df["a"], lit(array(lit("list"),lit("of"),lit("stuff"))))) == 0)

# Display result
df1.show() 


Output

+---+-----------------+
| id|                a|
+---+-----------------+
|  1|[list, of, stuff]|
+---+-----------------+


I hope this helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复