问题
Consider a generator in Julia that if collected will take a lot of memory
g=(x^2 for x=1:9999999999999999)
I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory
Until now the trick I was using was this
temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]
But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right
Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.
Thank you very much
回答1:
You can use a generator with if
condition like this:
[v for v in g if rand() < 0.01]
or if you want a bit faster, but more verbose approach (I have hardcoded 0.01
and element type of g
and I assume that your generator supports length
- otherwise you can remove sizehint!
line):
function collect_sample(g)
r = Int[]
sizehint!(r, round(Int, length(g) * 0.01))
for v in g
if rand() < 0.01
push!(r, v)
end
end
r
end
EDIT
Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:
function self_avoiding_sampler(source_size, ith, target_size)
rng = 1:source_size
idx = rand(rng)
x1 = ith(idx)
r = Vector{typeof(x1)}(undef, target_size)
r[1] = x1
s = Set{Int}(idx)
sizehint!(s, target_size)
for i = 2:target_size
while idx in s
idx = rand(rng)
end
@inbounds r[i] = ith(idx)
push!(s, idx)
end
r
end
function reservoir_sampler(g, target_size)
r = Vector{Int}(undef, target_size)
for (i, v) in enumerate(g)
if i <= target_size
@inbounds r[i] = v
else
j = rand(1:i)
if j < target_size
@inbounds r[j] = v
end
end
end
r
end
来源:https://stackoverflow.com/questions/56201002/efficient-way-to-extract-and-collect-a-random-subsample-of-a-generator-in-julia