Vectorized “in” function in julia?

前端 未结 5 1228
抹茶落季
抹茶落季 2020-11-28 13:53

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list =          


        
相关标签:
5条回答
  • 2020-11-28 14:13

    findin() doesn't give you a boolean mask, but you can easily use it to subset an array/DataFrame for values that are contained in another array:

    julia> giant_list[findin(giant_list, good_letters)]
    1-element Array{String,1}:
     "a"
    
    0 讨论(0)
  • 2020-11-28 14:19

    You can vectorize in quite easily in Julia v0.6, using the unified broadcasting syntax.

    julia> in.(giant_list, (good_letters,))
    3-element Array{Bool,1}:
      true
     false
     false
    

    Note the scalarification of good_letters by using a one-element tuple. Alternatively, you can use a Scalar type such as the one introduced in StaticArrays.jl.

    Julia v0.5 supports the same syntax, but requires a specialized function for scalarificiation (or the Scalar type mentioned earlier):

    scalar(x) = setindex!(Array{typeof(x)}(), x)
    

    after which

    julia> in.(giant_list, scalar(good_letters))
    3-element Array{Bool,1}:
      true
     false
     false
    
    0 讨论(0)
  • 2020-11-28 14:21

    Performance Review

    The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer vectors with 100,000 elements each.

    using StatsBase
    
    a = sample(1:1_000_000, 100_000)
    b = sample(1:1_000_000, 100_000)
    

    In order to know what a decent performance would be, I did the same thing in R, leading to a median performance of 4.4 ms:

    # R code
    
    a <- sample.int(1000000, 100000)
    b <- sample.int(1000000, 100000)
    
    microbenchmark::microbenchmark(a %in% b)
    
    Unit: milliseconds
         expr     min       lq     mean   median       uq      max neval
     a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126   100
    

    The performant Solution

    findall(in(b),a)
    
    5.039 ms (27 allocations: 3.63 MiB)
    

    Slower than R, but not by much. The syntax, however, could really use some improvement.

    The imperformant Solutions

    a .∈ Ref(b)
    in.(a,Ref(b))
    findall(x -> x in b, a)
    
    3.879468 seconds (6 allocations: 16.672 KiB)
    3.866001 seconds (6 allocations: 16.672 KiB)
    3.936978 seconds (178.88 k allocations: 5.788 MiB)
    

    800 times slower (almost 1000 times slower than R) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.

    The is-not-a Solution

    This one here

    indexin(a,b)
    
    5.287 ms (38 allocations: 6.53 MiB)
    

    is performant, but for me it is not a solution. It contains nothing elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.

    a[indexin(b,a)]
    
    ERROR: ArgumentError: unable to check bounds for indices of type Nothing
    
    0 讨论(0)
  • 2020-11-28 14:23

    There are a handful of modern (i.e. Julia v1.0) solutions to this problem:

    First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref object:

    julia> in.(giant_list, Ref(good_letters))
    3-element BitArray{1}:
      true
     false
     false
    

    This same result can be achieved by broadcasting the infix (\inTAB) operator:

    julia> giant_list .∈ Ref(good_letters)
    3-element BitArray{1}:
      true
     false
     false
    

    Additionally, calling in with one argument creates a Base.Fix2, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.

    julia> is_good1 = in(good_letters);
           is_good2(x) = x in good_letters;
    
    julia> is_good1.(giant_list)
    3-element BitArray{1}:
      true
     false
     false
    
    julia> is_good2.(giant_list)
    3-element BitArray{1}:
      true
     false
     false
    

    All in all, using .∈ with a Ref will probably lead to the shortest, cleanest code.

    0 讨论(0)
  • 2020-11-28 14:38

    The indexin function does something similar to what you want:

    indexin(a, b)

    Returns a vector containing the highest index in b for each value in a that is a member of b. The output vector contains 0 wherever a is not a member of b.

    Since you want a boolean for each element in your giant_list (instead of the index in good_letters), you can simply do:

    julia> indexin(giant_list, good_letters) .> 0
    3-element BitArray{1}:
      true
     false
     false
    

    The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b:

    function vectorin(a, b)
        bset = Set(b)
        [i in bset for i in a]
    end
    

    Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.

    0 讨论(0)
提交回复
热议问题