Combine corpora in tm 0.7.3

蓝咒 提交于 2019-12-13 03:54:11

问题


Using the text mining package tm for R, the following works in version 0.6.2, R version 3.4.3:

library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

However, the same code in tm version 0.7.3 (R version 3.4.2) gives an error:

Error in UseMethod("inspect", x) :
  no applicable method for 'inspect' applied to an object of class "list"

According to vignette("tm",package="tm"), the c() function is overloaded:

Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).

However, for the new version this is apparently no longer the case. How can two corpora be combined in tm 0.7.3? An obvious solution is to combine the documents first and create the corpus afterwards, but I'm looking for a solution to combine two already existing corpora.


回答1:


I do not have much experience with the tm package so my answer may lack some nuance in understanding of SimpleCorpus vs VCorpus vs other tm object classes.

The inputs to your call to c are the class SimpleCorpus; it doesn't look like tm comes with a c method specifically for this class. So method dispatch isn't calling the right c to combine the Corpora in the way you'd want. However, there is a c method for the VCorpus class (tm:::c.VCorpus).

There are 2 different ways to get past the issue of coercing corpus3 to a list, but they seem to result in different structures. I present both below and leave it up to you if they are accomplishing your end goal.

1) You can call tm:::c.VCorpus directly when defining corpus3:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
> 
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 2, document level (indexed): 0
Content:  documents: 4

[1] This is the first document.  This is the second document. This is the third document. 
[4] This is the fourth document.

2) You can use VCorpus when defining corpus1 & corpus2:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
> 
> corpus3 = c(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[4]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28


来源:https://stackoverflow.com/questions/48224166/combine-corpora-in-tm-0-7-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!