Generate ngrams with Julia

你说的曾经没有我的故事 提交于 2019-12-23 12:42:57

问题


To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
 "the"  
 "lazy" 
 "fox"  
 "jumps"
 "over" 
 "the"  
 "brown"
 "dog"  

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

To generate a trigram I could use the same collect(zip(...)) idiom to get:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox")  
 ("lazy","fox","jumps")
 ("fox","jumps","over")
 ("jumps","over","the")
 ("over","the","brown")
 ("the","brown","dog") 

But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?

e.g. I'll like to avoid doing this to extract 5-gram:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

回答1:


Here's a clean one-liner for n-grams of any length.

ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))

It uses a generator comprehension to iterate over the number of elements, k, to drop. Then, using the splat (...) operator, it unpacks the Drops into zip, and finally collects the Zip into an Array.

julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
 ("the","lazy")  
 ("lazy","fox")  
 ("fox","jumps") 
 ("jumps","over")
 ("over","the")  
 ("the","brown") 
 ("brown","dog") 

julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
 ("the","lazy","fox","jumps","over") 
 ("lazy","fox","jumps","over","the") 
 ("fox","jumps","over","the","brown")
 ("jumps","over","the","brown","dog")

As you can see, this is very similar to your solution - only a simple comprehension was added to iterate over the number of elements to drop, so that the length could be dynamic.




回答2:


Another way is to use Iterators.jl's partition():

ngram(s,n) = collect(partition(s, n, 1))



回答3:


By changing the output slightly and using SubArrays instead of Tuples, little is lost, but it is possible to avoid allocations and memory copying. If the underlying word list is static, this is OK and faster (in my benchmarks too). The code:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

and the output:

julia> ngram(s,5)
 SubString{String}["the","lazy","fox","jumps","over"] 
 SubString{String}["lazy","fox","jumps","over","the"] 
 SubString{String}["fox","jumps","over","the","brown"]
 SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

For larger word lists the memory requirements are substantially smaller also.

Also note using a generator allows processing the ngrams one-by-one faster and with less memory and might be enough for the desired processing code (counting something or passing through some hash). For example, using @Gnimuc's solution without the collect i.e. just partition(s, n, 1).



来源:https://stackoverflow.com/questions/42360957/generate-ngrams-with-julia

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!