I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing
giant_list =
findin()
doesn't give you a boolean mask, but you can easily use it to subset an array/DataFrame for values that are contained in another array:
julia> giant_list[findin(giant_list, good_letters)]
1-element Array{String,1}:
"a"
You can vectorize in
quite easily in Julia v0.6, using the unified broadcasting syntax.
julia> in.(giant_list, (good_letters,))
3-element Array{Bool,1}:
true
false
false
Note the scalarification of good_letters
by using a one-element tuple. Alternatively, you can use a Scalar
type such as the one introduced in StaticArrays.jl.
Julia v0.5 supports the same syntax, but requires a specialized function for scalarificiation (or the Scalar
type mentioned earlier):
scalar(x) = setindex!(Array{typeof(x)}(), x)
after which
julia> in.(giant_list, scalar(good_letters))
3-element Array{Bool,1}:
true
false
false
The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer
vectors with 100,000 elements each.
using StatsBase
a = sample(1:1_000_000, 100_000)
b = sample(1:1_000_000, 100_000)
In order to know what a decent performance would be, I did the same thing in R
, leading to a median performance of 4.4 ms
:
# R code
a <- sample.int(1000000, 100000)
b <- sample.int(1000000, 100000)
microbenchmark::microbenchmark(a %in% b)
Unit: milliseconds
expr min lq mean median uq max neval
a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126 100
findall(in(b),a)
5.039 ms (27 allocations: 3.63 MiB)
Slower than R
, but not by much. The syntax, however, could really use some improvement.
a .∈ Ref(b)
in.(a,Ref(b))
findall(x -> x in b, a)
3.879468 seconds (6 allocations: 16.672 KiB)
3.866001 seconds (6 allocations: 16.672 KiB)
3.936978 seconds (178.88 k allocations: 5.788 MiB)
800 times slower (almost 1000 times slower than R
) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.
This one here
indexin(a,b)
5.287 ms (38 allocations: 6.53 MiB)
is performant, but for me it is not a solution. It contains nothing
elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.
a[indexin(b,a)]
ERROR: ArgumentError: unable to check bounds for indices of type Nothing
There are a handful of modern (i.e. Julia v1.0) solutions to this problem:
First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref
object:
julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
true
false
false
This same result can be achieved by broadcasting the infix ∈
(\in
TAB) operator:
julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
true
false
false
Additionally, calling in
with one argument creates a Base.Fix2
, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.
julia> is_good1 = in(good_letters);
is_good2(x) = x in good_letters;
julia> is_good1.(giant_list)
3-element BitArray{1}:
true
false
false
julia> is_good2.(giant_list)
3-element BitArray{1}:
true
false
false
All in all, using .∈
with a Ref
will probably lead to the shortest, cleanest code.
The indexin function does something similar to what you want:
indexin(a, b)
Returns a vector containing the highest index in
b
for each value ina
that is a member ofb
. The output vector contains 0 wherevera
is not a member ofb
.
Since you want a boolean for each element in your giant_list
(instead of the index in good_letters
), you can simply do:
julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
true
false
false
The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b
:
function vectorin(a, b)
bset = Set(b)
[i in bset for i in a]
end
Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.