Vectorized “in” function in julia?

前端未结

关注

 5  1255

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list =


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2020-11-28 14:13
              
            
            
                                                                       
findin() doesn't give you a boolean mask, but you can easily use it to subset an array/DataFrame for values that are contained in another array:

julia> giant_list[findin(giant_list, good_letters)]
1-element Array{String,1}:
 "a"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-11-28 14:19
              
            
            
                                                                       
You can vectorize in quite easily in Julia v0.6, using the unified broadcasting syntax.

julia> in.(giant_list, (good_letters,))
3-element Array{Bool,1}:
  true
 false
 false


Note the scalarification of good_letters by using a one-element tuple. Alternatively, you can use a Scalar type such as the one introduced in StaticArrays.jl.

Julia v0.5 supports the same syntax, but requires a specialized function for scalarificiation (or the Scalar type mentioned earlier):

scalar(x) = setindex!(Array{typeof(x)}(), x)


after which

julia> in.(giant_list, scalar(good_letters))
3-element Array{Bool,1}:
  true
 false
 false

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-11-28 14:21
              
            
            
                                                                       
Performance Review
The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer vectors with 100,000 elements each.
using StatsBase

a = sample(1:1_000_000, 100_000)
b = sample(1:1_000_000, 100_000)

In order to know what a decent performance would be, I did the same thing in R, leading to a median performance of 4.4 ms:
# R code

a <- sample.int(1000000, 100000)
b <- sample.int(1000000, 100000)

microbenchmark::microbenchmark(a %in% b)

Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval
 a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126   100

The performant Solution
findall(in(b),a)

5.039 ms (27 allocations: 3.63 MiB)

Slower than R, but not by much. The syntax, however, could really use some improvement.
The imperformant Solutions
a .∈ Ref(b)
in.(a,Ref(b))
findall(x -> x in b, a)

3.879468 seconds (6 allocations: 16.672 KiB)
3.866001 seconds (6 allocations: 16.672 KiB)
3.936978 seconds (178.88 k allocations: 5.788 MiB)

800 times slower (almost 1000 times slower than R) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.
The is-not-a Solution
This one here
indexin(a,b)

5.287 ms (38 allocations: 6.53 MiB)

is performant, but for me it is not a solution. It contains nothing elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.
a[indexin(b,a)]

ERROR: ArgumentError: unable to check bounds for indices of type Nothing

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-11-28 14:23
              
            
            
                                                                       
There are a handful of modern (i.e. Julia v1.0) solutions to this problem:

First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref object:

julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
  true
 false
 false


This same result can be achieved by broadcasting the infix ∈ (\inTAB) operator:

julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
  true
 false
 false


Additionally, calling in with one argument creates a Base.Fix2, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.

julia> is_good1 = in(good_letters);
       is_good2(x) = x in good_letters;

julia> is_good1.(giant_list)
3-element BitArray{1}:
  true
 false
 false

julia> is_good2.(giant_list)
3-element BitArray{1}:
  true
 false
 false


All in all, using .∈ with a Ref will probably lead to the shortest, cleanest code.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-11-28 14:38
              
            
            
                                                                       
The indexin function does something similar to what you want:

indexin(a, b)
Returns a vector containing the highest index in b for each value in a that is a member of b. The output vector contains 0 wherever a is not a member of b.

Since you want a boolean for each element in your giant_list (instead of the index in good_letters), you can simply do:
julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
  true
 false
 false

The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b:
function vectorin(a, b)
    bset = Set(b)
    [i in bset for i in a]
end

Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复