qdap

R: How to prevent memory overflow when using mgsub in vector mode?

落爺英雄遲暮 提交于 2019-12-02 03:20:17
问题 I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example: library(qdap) line = c("one", "two one", "four phones") e = c("one", "two") r = c("ONE", "TWO") line = mgsub(e,r,line) Result: [1] "ONE" "TWO ONE" "four phONEs" As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j] . It works fine on a relatively small "line

R qdap::mgsub, how to pass a pattern with a regular expression?

我们两清 提交于 2019-11-29 17:47:20
In a previous question ( replace string in R giving a vector of patterns and vector of replacements ) y found that mgsub does have as pattern a string that does not need to br escape. That is good when you want to replace text like '[%.+%]' as a literal string, but then is a bad thing if you need to pass a real regular expression like: library('stringr') library('qdap') tt_ori <- 'I have VAR1 and VAR2' ttl <- list(ttregex='VAR([12])', val="val-\\1") ttl # OK stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$val) # [1] "I have val-1 and val-2" # OK mapply(gsub, ttl$ttregex, ttl$val, tt

R break corpus into sentences

江枫思渺然 提交于 2019-11-28 20:46:47
I have a number of PDF documents, which I have read into a corpus with library tm . How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually. How can I pass function sentSplit { qdap } over a corpus in tm ? Or is there a better way?. Note: there was a function sentDetect in library openNLP , which is now Maxent_Sent_Token_Annotator - the same question applies: how can this be combined with a

More efficient means of creating a corpus and DTM with 4M rows

大憨熊 提交于 2019-11-28 16:35:24
My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

R break corpus into sentences

☆樱花仙子☆ 提交于 2019-11-27 20:50:26
问题 I have a number of PDF documents, which I have read into a corpus with library tm . How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually. How can I pass function sentSplit { qdap } over a corpus in tm ? Or is there a better way?. Note: there was a function sentDetect in library openNLP ,

More efficient means of creating a corpus and DTM with 4M rows

落花浮王杯 提交于 2019-11-27 19:56:41
问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")

Convert written number to number in R

旧街凉风 提交于 2019-11-27 01:58:23
Does anybody know a function to convert a text representation of a number into an actual number, e.g. 'twenty thousand three hundred and five' into 20305. I have written numbers in dataframe rows and want to convert them to numbers. In package qdap, you can replace numeric represented numbers with words (e.g., 1001 becomes one thousand one), but not the other way around: library(qdap) replace_number("I like 346457 ice cream cones.") [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones." Here's a start that should get you to hundreds of thousands. word2num <-

Convert written number to number in R

坚强是说给别人听的谎言 提交于 2019-11-26 09:47:35
问题 Does anybody know a function to convert a text representation of a number into an actual number, e.g. \'twenty thousand three hundred and five\' into 20305. I have written numbers in dataframe rows and want to convert them to numbers. In package qdap, you can replace numeric represented numbers with words (e.g., 1001 becomes one thousand one), but not the other way around: library(qdap) replace_number(\"I like 346457 ice cream cones.\") [1] \"I like three hundred forty six thousand four