Removing non-English text from Corpus in R using tm()

前端 未结 2 622
孤城傲影
孤城傲影 2020-12-28 20:35

I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my da

相关标签:
2条回答
  • 2020-12-28 21:33

    You can also use the package "stringi".

    Using the above example:

    library(stringi)
    dat <- "Special,  satisfação, Happy, Sad, Potential, für"
    stringi::stri_trans_general(dat, "latin-ascii")
    

    Output:

    [1] "Special,  satisfacao, Happy, Sad, Potential, fur"  
    
    0 讨论(0)
  • 2020-12-28 21:38

    Here's a method to remove words with non-ASCII characters before making a corpus:

    # remove words with non-ASCII characters
    # assuming you read your txt file in as a vector, eg. 
    # dat <- readLines('~/temp/dat.txt')
    dat <- "Special,  satisfação, Happy, Sad, Potential, für"
    # convert string to vector of words
    dat2 <- unlist(strsplit(dat, split=", "))
    # find indices of words with non-ASCII characters
    dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
    # subset original vector of words to exclude words with non-ASCII char
    dat4 <- dat2[-dat3]
    # convert vector back to a string
    dat5 <- paste(dat4, collapse = ", ")
    # make corpus
    require(tm)
    words1 <- Corpus(VectorSource(dat5))
    inspect(words1)
    
    A corpus with 1 text document
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 
    
    [[1]]
    Special, Happy, Sad, Potential
    
    0 讨论(0)
提交回复
热议问题