Use tm's Corpus function with big data in R

百般思念 提交于 2019-12-21 04:48:06

问题


I'm trying to do text mining on big data in R with tm.

I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as

  • using 64-bit R
  • trying different OS's (Windows, Linux, Solaris, etc)
  • setting memory.limit() to its maximum
  • making sure that sufficient RAM and compute is available on the server (which there is)
  • making liberal use of gc()
  • profiling the code for bottlenecks
  • breaking up big operations into multiple smaller operations

However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:

> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)

Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?

The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.

UPDATE

I've been experimenting with trying to combine smaller corpa, i.e.

test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...

> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"

回答1:


I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.

This StackOverflow post had a simple way to do that without tm_combine:

test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]

ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))

#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)

which gives you:

ds.12

<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>

Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.



来源:https://stackoverflow.com/questions/25533594/use-tms-corpus-function-with-big-data-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!