I\'m having a pretty stubborn issue... I can\'t seem to remove the <+f0b7>
and <+f0a0>
string from Corpora that have loaded from
Ok. The problem is that your data has an unusual unicode character in it. In R, we typically escape this character as "\uf0b7". But when inspect()
prints it's data, it encodes it as "". Observe
sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))
# A document-term matrix (1 documents, 3 terms)
#
# Non-/sparse entries: 3/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs character crazy
# 1 1 1 1
(actually i had to create this output on a Windows machine running R 3.0.2 - it worked fine on my Mac running R 3.1.0).
Unfortunately you will not be able to remove this with remove words because the regular expression used in that function required that word boundaries appear on both sides of the "word" and since this doesn't seem to be a recognized character for a boundary. See
gsub("\uf0b7","",sample)
# [1] "Crazy Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy Character"
So we can write our own function we can use with tm_map
. Consider
removeCharacters <-function (x, characters) {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}
which is basically the removeWords function just without the boundary conditions. Then we can run
cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))
# A document-term matrix (1 documents, 2 terms)
#
# Non-/sparse entries: 2/0
# Sparsity : 0%
# Maximal term length: 9
# Weighting : term frequency (tf)
#
# Terms
# Docs character crazy
# 1 1 1
and we see those unicode characters are no longer there.