Working with text classification and big sparse matrices in R

不问归期 提交于 2019-12-21 22:22:49

问题


I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.

I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.

I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.


回答1:


At what moment did you reach ram constraints?

quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).

Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.

Following method will work good for text classification:

  1. logistic regression with L1 penalty (see glmnet package)
  2. Linear SVM (see LiblineaR, but worth to serach for alternatives)
  3. Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.


来源:https://stackoverflow.com/questions/38755207/working-with-text-classification-and-big-sparse-matrices-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!