问题
I created a corpus in R using package tm specifying language and encoding as follows:
de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl
= list(language="de_DE",encoding = "UTF_8"))
de_DE.corpus[36]$content
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
(encoding = 'UTF-8'))
inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)])
inspect(de_DE.dtm[36, ])
If I see the content in de_DE.corpus[36]$content
of document 36 which has 'ü' the text is shown correctly. e.g. " ...Single ist so die Begründung der Behörde Eine... "
But when I create the DocumentTermMatrix (I tried multiple options for encoding and language) I am getting words like "begrÃ" where for example is the word "Begründung". See result after executing inspect(de_DE.dtm[36, ])
.
<<DocumentTermMatrix (documents: 1, terms: 21744)>>
Non-/sparse entries: 102/21642
Sparsity : 100%
Maximal term length: 43
Weighting : term frequency (tf)
Sample :
Terms
Docs begrà das dem der die eine einen jobcenter und zum
36 3 4 2 4 8 2 2 4 3 3
I would appreciate if someone knows how to fix the problem. Thanks in advance :)
回答1:
Can you check your input data? Because your code works for me. So I think you have an issue when you are loading it already in de_DE.sample.
doc<-c("Single ist so die Begründung der Behörde Eine", "Single Begründung Behörde ")
de_DE.corpus <- Corpus(VectorSource(doc), readerControl
= list(language="de_DE",encoding = "UTF_8"))
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
(encoding = 'UTF-8'))
inspect(de_DE.dtm[1, ])
<<DocumentTermMatrix (documents: 1, terms: 7)>>
Non-/sparse entries: 7/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs begründung behörde der die eine ist single
1 1 1 1 1 1 1 1
来源:https://stackoverflow.com/questions/45555294/issue-in-documenttermmatrix-with-corpus-in-german