How to write custom removePunctuation() function to better deal with Unicode chars?

后端 未结 2 507
闹比i
闹比i 2021-01-02 11:29

In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation() function, currently defined as:

functi         


        
相关标签:
2条回答
  • 2021-01-02 11:46

    As much as I like Susana's answer it is breaking the Corpus in newer versions of tm (No longer a PlainTextDocument and destroying the meta)

    You will get a list and the following error:

    Error in UseMethod("meta", x) : 
    no applicable method for 'meta' applied to an object of class "character"
    

    Using

    tm_map(your_corpus, PlainTextDocument)
    

    will give you back your corpus but with broken $meta (in particular document ids will be missing.

    Solution

    Use content_transformer

    toSpace <- content_transformer(function(x,pattern)
        gsub(pattern," ", x))
    your_corpus <- tm_map(your_corpus,toSpace,"„")
    

    Source: Hands-On Data Science with R, Text Mining, Graham.Williams@togaware.com http://onepager.togaware.com/

    Update

    This function removes everything that is not alpha numeric (i.e. UTF-8 emoticons etc.)

    removeNonAlnum <- function(x){
      gsub("[^[:alnum:]^[:space:]]","",x)
    }
    
    0 讨论(0)
  • 2021-01-02 11:59

    I had the same problem, custom function was not working, but actually the first line below has to be added

    Regards

    Susana

    replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
    
    replaceExpressions.PlainTextDocument <- replaceExpressions.character  <- function(x) {
        x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
        x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
        x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
        return(x)
    }
    
    notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)
    
    0 讨论(0)
提交回复
热议问题