Efficient way to extract and collect a random subsample of a generator in Julia

此生再无相见时 提交于 2021-01-28 02:45:37

问题


Consider a generator in Julia that if collected will take a lot of memory

g=(x^2 for x=1:9999999999999999)

I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory

Until now the trick I was using was this

temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]

But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right

Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.

Thank you very much


回答1:


You can use a generator with if condition like this:

[v for v in g if rand() < 0.01]

or if you want a bit faster, but more verbose approach (I have hardcoded 0.01 and element type of g and I assume that your generator supports length - otherwise you can remove sizehint! line):

function collect_sample(g)
    r = Int[]
    sizehint!(r, round(Int, length(g) * 0.01))
    for v in g
        if rand() < 0.01
           push!(r, v)
        end
    end
    r
end

EDIT

Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:

function self_avoiding_sampler(source_size, ith, target_size)
    rng = 1:source_size
    idx = rand(rng)
    x1 = ith(idx)
    r = Vector{typeof(x1)}(undef, target_size)
    r[1] = x1
    s = Set{Int}(idx)
    sizehint!(s, target_size)
    for i = 2:target_size
        while idx in s
            idx = rand(rng)
        end
        @inbounds r[i] = ith(idx)
        push!(s, idx)
    end
    r
end

function reservoir_sampler(g, target_size)
    r = Vector{Int}(undef, target_size)
    for (i, v) in enumerate(g)
        if i <= target_size
            @inbounds r[i] = v
        else
            j = rand(1:i)
            if j < target_size
                @inbounds r[j] = v
            end
        end
    end
    r
end


来源:https://stackoverflow.com/questions/56201002/efficient-way-to-extract-and-collect-a-random-subsample-of-a-generator-in-julia

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!