问题
I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for example : aa,ab,ac,...,a8,a9,a/,a ,ba,bb,... Then I carry out a loop for each address and extract for all the bigrams the information 0 or 1 (bi-gram not present or present). Afterward, I calculate for the bigrams that occur the most the trigrams. And so on ... My problem is the time that the algo takes to run. Another problem : R reach its maximal capacity when there are more than 10000 NGrams. It's obvious because a 50000*10000 matrice is huge. I need your ideas to optimize the algo or to change it. Thank you.
回答1:
Try the quanteda
package, using this method. If you just want tokenized texts, replace the dfm(
with tokenize(
.
I'd be very interested to know how it works on your 50,000 street addresses. We've put a lot of effort into making dfm()
very fast and robust.
myDfm <- dfm(c("1780 wemmel", "2015 schlemmel"), what = "character",
ngram = 1:5, concatenator = "",
removePunct = FALSE, removeNumbers = FALSE,
removeSeparators = FALSE, verbose = FALSE)
t(myDfm) # for easier viewing
# docs
# features text1 text2
# 1 1
# s 0 1
# sc 0 1
# sch 0 1
# schl 0 1
# w 1 0
# we 1 0
# wem 1 0
# wemm 1 0
# 0 1 1
# 0 1 0
# 0 w 1 0
# 0 we 1 0
# 0 wem 1 0
# 01 0 1
# 015 0 1
# 015 0 1
# 015 s 0 1
# 1 1 1
# 15 0 1
# 15 0 1
# 15 s 0 1
# 15 sc 0 1
# 17 1 0
# 178 1 0
# 1780 1 0
# 1780 1 0
# 2 0 1
# 20 0 1
# 201 0 1
# 2015 0 1
# 2015 0 1
# 5 0 1
# 5 0 1
# 5 s 0 1
# 5 sc 0 1
# 5 sch 0 1
# 7 1 0
# 78 1 0
# 780 1 0
# 780 1 0
# 780 w 1 0
# 8 1 0
# 80 1 0
# 80 1 0
# 80 w 1 0
# 80 we 1 0
# c 0 1
# ch 0 1
# chl 0 1
# chle 0 1
# chlem 0 1
# e 2 2
# el 1 1
# em 1 1
# emm 1 1
# emme 1 1
# emmel 1 1
# h 0 1
# hl 0 1
# hle 0 1
# hlem 0 1
# hlemm 0 1
# l 1 2
# le 0 1
# lem 0 1
# lemm 0 1
# lemme 0 1
# m 2 2
# me 1 1
# mel 1 1
# mm 1 1
# mme 1 1
# mmel 1 1
# s 0 1
# sc 0 1
# sch 0 1
# schl 0 1
# schle 0 1
# w 1 0
# we 1 0
# wem 1 0
# wemm 1 0
# wemme 1 0
回答2:
Some of these problems are, to an extent, already solved by the tm
library and RWeka
(for n-gram tokenization). Have a look at those, they might make your task easier.
For running out of memory I see two options:
tm
uses sparse matrices, which are an efficient way of storing matrices with many zero elements.You could also look at the
bigmemory
package. Although, I've never used it http://cran.r-project.org/web/packages/bigmemory/index.html
There's lots of ways of speeding up R code. Here's a guide to some of the ways of doing it: http://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/
来源:https://stackoverflow.com/questions/31424687/cpu-and-memory-efficient-ngram-extraction-with-r