问题
Using the text mining package tm
for R, the following works in version 0.6.2, R version 3.4.3:
library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
However, the same code in tm
version 0.7.3 (R version 3.4.2) gives an error:
Error in UseMethod("inspect", x) :
no applicable method for 'inspect' applied to an object of class "list"
According to vignette("tm",package="tm")
, the c()
function is overloaded:
Many standard operators and functions (
[, [<-, [[, [[<-, c(), lapply()
) are available for corpora with semantics similar to standard R routines. E.g.,c()
concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).
However, for the new version this is apparently no longer the case. How can two corpora be combined in tm
0.7.3? An obvious solution is to combine the documents first and create the corpus afterwards, but I'm looking for a solution to combine two already existing corpora.
回答1:
I do not have much experience with the tm
package so my answer may lack some nuance in understanding of SimpleCorpus
vs VCorpus
vs other tm
object classes.
The inputs to your call to c
are the class SimpleCorpus
; it doesn't look like tm
comes with a c
method specifically for this class. So method dispatch isn't calling the right c
to combine the Corpora in the way you'd want. However, there is a c
method for the VCorpus
class (tm:::c.VCorpus
).
There are 2 different ways to get past the issue of coercing corpus3
to a list
, but they seem to result in different structures. I present both below and leave it up to you if they are accomplishing your end goal.
1) You can call tm:::c.VCorpus
directly when defining corpus3
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
>
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 2, document level (indexed): 0
Content: documents: 4
[1] This is the first document. This is the second document. This is the third document.
[4] This is the fourth document.
2) You can use VCorpus
when defining corpus1
& corpus2
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
>
> corpus3 = c(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28
[[3]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[4]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28
来源:https://stackoverflow.com/questions/48224166/combine-corpora-in-tm-0-7-3