Is scikit-learn suitable for big data tasks?

后端 未结 1 1385
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-31 05:25

I\'m working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors a

1条回答
  •  庸人自扰
    2021-01-31 05:50

    HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.

    You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.

    You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.

    You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.

    I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736

    There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

    0 讨论(0)
提交回复
热议问题