qdap | 易学教程

Matching a list of phrases to a corpus of documents and returning phrase frequency

阅读更多关于 Matching a list of phrases to a corpus of documents and returning phrase frequency

问题 I have a list of phrases and a corpus of documents.There are 100k+ phrases and 60k+ documents in the corpus. The phrases are might/might not present in the corpus. I'm looking forward to find the term frequency of each phrase present in the corpus. An example dataset: Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") Doc1 <- "If you're just starting with workout, begin slow." Doc2 <- "Don't jump in brain initial and

Using mgsub function with word boundaries for replacement values

阅读更多关于 Using mgsub function with word boundaries for replacement values

问题 I am trying to replace substrings of string elements within a vector with blank spaces. Below are the vectors we are considering: test <- c("PALMA DE MALLORCA", "THE RICH AND THE POOR", "A CAMEL IN THE DESERT", "SANTANDER SL", "LA") lista <- c("EL", "LA", "ES", "DE", "Y", "DEL", "LOS", "S.L.", "S.A.", "S.C.", "LAS", "DEL", "THE", "OF", "AND", "BY", "S", "L", "A", "C", "SA", "SC", "SL") Then if we apply the mgsub function as it is, we get the following output: library(qdap) mgsub(lista, "",

Using mgsub function with word boundaries for replacement values

阅读更多关于 Using mgsub function with word boundaries for replacement values

replace string in R giving a vector of patterns and vector of replacements

阅读更多关于 replace string in R giving a vector of patterns and vector of replacements

问题 Given a string with different placeholders I want to replace, does R have a function that replace all of them given a vector of patterns and a vector of replacements? I have managed to accomplish that with a list and a loop > library(stringr) > tt_ori <- 'I have [%VAR1%] and [%VAR2%]' > tt_out <- tt_ori > ttlist <- list('\\[%VAR1%\\]'="val-1", '\\[%VAR2%\\]'="val-2") > ttlist $`\\[%VAR1%\\]` [1] "val-1" $`\\[%VAR2%\\]` [1] "val-2" > for(var in names(ttlist)) { + print(paste0(var," -> ",ttlist

replace string in R giving a vector of patterns and vector of replacements

阅读更多关于 replace string in R giving a vector of patterns and vector of replacements

Count number of times a word-wildcard appears in text (in R)

阅读更多关于 Count number of times a word-wildcard appears in text (in R)

问题 I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to: 1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1). 2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2). I'm able to achieve (1), but not (2). Can anyone please help? thanks. library(tm) library(qdap) text <-

Replace the string value with value in the find list in R

阅读更多关于 Replace the string value with value in the find list in R

问题 I have a dataset that has a column like string<-c('lib1_Rstudio_case1','lib2_Rstudio_case1and2','lib5_python_notthe correct_language','lib3_Jupyter_really_good','lib1_spyder_nice','lib1_R_the_core') replacement<-c('Rstudio','Jupyter','spyder','R') I want to replace the string value id they match the value in replacement. I am using the following code right now gsub(paste(replacement, collapse = "|"), replacement = replacement, x = string) This in another piece of code which i am using to find

Estimating document polarity using R's qdap package without sentSplit

阅读更多关于 Estimating document polarity using R's qdap package without sentSplit

问题 I'd like to apply qdap 's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example: library(qdap) polarity(DATA$state)$all$polarity # Results: [1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000 [10] 0.4082 0.0000 Warning message: In polarity(DATA$state) : Some rows contain double punctuation. Suggested use of `sentSplit` function. This warning can't be ignored, as it

Estimating document polarity using R's qdap package without sentSplit

阅读更多关于 Estimating document polarity using R's qdap package without sentSplit

I'd like to apply qdap 's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example: library(qdap) polarity(DATA$state)$all$polarity # Results: [1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000 [10] 0.4082 0.0000 Warning message: In polarity(DATA$state) : Some rows contain double punctuation. Suggested use of `sentSplit` function. This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level

R qdap::mgsub, how to pass a pattern with a regular expression?

阅读更多关于 R qdap::mgsub, how to pass a pattern with a regular expression?

问题 In a previous question (replace string in R giving a vector of patterns and vector of replacements) y found that mgsub does have as pattern a string that does not need to br escape. That is good when you want to replace text like '[%.+%]' as a literal string, but then is a bad thing if you need to pass a real regular expression like: library('stringr') library('qdap') tt_ori <- 'I have VAR1 and VAR2' ttl <- list(ttregex='VAR([12])', val="val-\\1") ttl # OK stringr::str_replace_all( tt_ori,