Remove unicode <+f0b7> from Corpus text

后端 未结 1 1051
你的背包
你的背包 2021-01-16 21:24

I\'m having a pretty stubborn issue... I can\'t seem to remove the <+f0b7> and <+f0a0> string from Corpora that have loaded from

相关标签:
1条回答
  • 2021-01-16 22:09

    Ok. The problem is that your data has an unusual unicode character in it. In R, we typically escape this character as "\uf0b7". But when inspect() prints it's data, it encodes it as "". Observe

    sample<-c("Crazy \uf0b7 Character")
    cp<-Corpus(VectorSource(sample))
    inspect(DocumentTermMatrix(cp))
    
    # A document-term matrix (1 documents, 3 terms)
    # 
    # Non-/sparse entries: 3/0
    # Sparsity           : 0%
    # Maximal term length: 9 
    # Weighting          : term frequency (tf)
    # 
    #     Terms
    # Docs <U+F0B7> character crazy
    #    1        1         1     1
    

    (actually i had to create this output on a Windows machine running R 3.0.2 - it worked fine on my Mac running R 3.1.0).

    Unfortunately you will not be able to remove this with remove words because the regular expression used in that function required that word boundaries appear on both sides of the "word" and since this doesn't seem to be a recognized character for a boundary. See

    gsub("\uf0b7","",sample)
    # [1] "Crazy  Character"
    gsub("\\b\uf0b7\\b","",sample)
    #[1] "Crazy  Character"
    

    So we can write our own function we can use with tm_map. Consider

    removeCharacters <-function (x, characters)  {
    gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
    }
    

    which is basically the removeWords function just without the boundary conditions. Then we can run

    cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
    inspect(DocumentTermMatrix(cp2))
    
    # A document-term matrix (1 documents, 2 terms)
    # 
    # Non-/sparse entries: 2/0
    # Sparsity           : 0%
    # Maximal term length: 9 
    # Weighting          : term frequency (tf)
    # 
    #     Terms
    # Docs character crazy
    #    1         1     1
    

    and we see those unicode characters are no longer there.

    0 讨论(0)
提交回复
热议问题