In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation()
function, currently defined as:
functi
As much as I like Susana's answer it is breaking the Corpus in newer versions of tm (No longer a PlainTextDocument and destroying the meta)
You will get a list and the following error:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
Using
tm_map(your_corpus, PlainTextDocument)
will give you back your corpus but with broken $meta (in particular document ids will be missing.
Solution
Use content_transformer
toSpace <- content_transformer(function(x,pattern)
gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")
Source: Hands-On Data Science with R, Text Mining, Graham.Williams@togaware.com http://onepager.togaware.com/
This function removes everything that is not alpha numeric (i.e. UTF-8 emoticons etc.)
removeNonAlnum <- function(x){
gsub("[^[:alnum:]^[:space:]]","",x)
}
I had the same problem, custom function was not working, but actually the first line below has to be added
Regards
Susana
replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument <- replaceExpressions.character <- function(x) {
x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
return(x)
}
notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)