mining

Hashset handling to avoid stuck in loop during iteration

折月煮酒 提交于 2019-12-02 15:08:45
问题 I'm working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hashset that contains the main urls and within the iteration I go and download the the page of the main URL and add them to the Hashet, and go on , and during iteration I should exclude every scanned url, and also exclude ( remove ) every url that end with jpg, this until the Hashet of url count reaches 0, the question is

Converting a Document Term Matrix into a Matrix with lots of data causes overflow

蹲街弑〆低调 提交于 2019-11-30 05:11:34
Let's do some Text Mining Here I stand with a document term matrix (from the tm Package) dtm <- TermDocumentMatrix( myCorpus, control = list( weight = weightTfIdf, tolower=TRUE, removeNumbers = TRUE, minWordLength = 2, removePunctuation = TRUE, stopwords=stopwords("german") )) When I do a typeof(dtm) I see that it is a "list" and the structure looks like Docs Terms 1 2 ... lorem 0 0 ... ipsum 0 0 ... ... ....... So I try a wordMatrix = as.data.frame( t(as.matrix( dtm )) ) That works for 1000 Documents. But when I try to use 40000 it doesn't anymore. I get this error: Fehler in vector(typeof(x

Converting a Document Term Matrix into a Matrix with lots of data causes overflow

孤者浪人 提交于 2019-11-29 03:07:17
问题 Let's do some Text Mining Here I stand with a document term matrix (from the tm Package) dtm <- TermDocumentMatrix( myCorpus, control = list( weight = weightTfIdf, tolower=TRUE, removeNumbers = TRUE, minWordLength = 2, removePunctuation = TRUE, stopwords=stopwords("german") )) When I do a typeof(dtm) I see that it is a "list" and the structure looks like Docs Terms 1 2 ... lorem 0 0 ... ipsum 0 0 ... ... ....... So I try a wordMatrix = as.data.frame( t(as.matrix( dtm )) ) That works for 1000