qdap | 易学教程

R: How to prevent memory overflow when using mgsub in vector mode?

阅读更多关于 R: How to prevent memory overflow when using mgsub in vector mode?

问题 I have a long vector of characters (e.g. "Hello World", etc), 1.7M rows, and I need to substitute words in them using a map between two vectors, and save the result in same vector. Here's a simple example: library(qdap) line = c("one", "two one", "four phones") e = c("one", "two") r = c("ONE", "TWO") line = mgsub(e,r,line) Result: [1] "ONE" "TWO ONE" "four phONEs" As you can see, each instance of e[j] in line gets substituted with r[j] and only r[j] . It works fine on a relatively small "line

R qdap::mgsub, how to pass a pattern with a regular expression?

阅读更多关于 R qdap::mgsub, how to pass a pattern with a regular expression?

In a previous question ( replace string in R giving a vector of patterns and vector of replacements ) y found that mgsub does have as pattern a string that does not need to br escape. That is good when you want to replace text like '[%.+%]' as a literal string, but then is a bad thing if you need to pass a real regular expression like: library('stringr') library('qdap') tt_ori <- 'I have VAR1 and VAR2' ttl <- list(ttregex='VAR([12])', val="val-\\1") ttl # OK stringr::str_replace_all( tt_ori, perl( ttl$ttregex), ttl$val) # [1] "I have val-1 and val-2" # OK mapply(gsub, ttl$ttregex, ttl$val, tt

R break corpus into sentences

阅读更多关于 R break corpus into sentences

I have a number of PDF documents, which I have read into a corpus with library tm . How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually. How can I pass function sentSplit { qdap } over a corpus in tm ? Or is there a better way?. Note: there was a function sentDetect in library openNLP , which is now Maxent_Sent_Token_Annotator - the same question applies: how can this be combined with a

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")) doc.corpus <- tm_map(doc.corpus, stemDocument, "english") doc.corpus <- tm_map(doc.corpus,

R break corpus into sentences

阅读更多关于 R break corpus into sentences

问题 I have a number of PDF documents, which I have read into a corpus with library tm . How can one break the corpus into sentences? It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually. How can I pass function sentSplit { qdap } over a corpus in tm ? Or is there a better way?. Note: there was a function sentDetect in library openNLP ,

More efficient means of creating a corpus and DTM with 4M rows

阅读更多关于 More efficient means of creating a corpus and DTM with 4M rows

问题 My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { doc.corpus <- Corpus(VectorSource(textVector)) doc.corpus <- tm_map(doc.corpus, tolower) doc.corpus <- tm_map(doc.corpus, removeNumbers) doc.corpus <- tm_map(doc.corpus, removePunctuation) doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english")

Convert written number to number in R

阅读更多关于 Convert written number to number in R

Does anybody know a function to convert a text representation of a number into an actual number, e.g. 'twenty thousand three hundred and five' into 20305. I have written numbers in dataframe rows and want to convert them to numbers. In package qdap, you can replace numeric represented numbers with words (e.g., 1001 becomes one thousand one), but not the other way around: library(qdap) replace_number("I like 346457 ice cream cones.") [1] "I like three hundred forty six thousand four hundred fifty seven ice cream cones." Here's a start that should get you to hundreds of thousands. word2num <-

Convert written number to number in R

阅读更多关于 Convert written number to number in R

问题 Does anybody know a function to convert a text representation of a number into an actual number, e.g. \'twenty thousand three hundred and five\' into 20305. I have written numbers in dataframe rows and want to convert them to numbers. In package qdap, you can replace numeric represented numbers with words (e.g., 1001 becomes one thousand one), but not the other way around: library(qdap) replace_number(\"I like 346457 ice cream cones.\") [1] \"I like three hundred forty six thousand four