I am using scikit library for using svm. I have huge amount of data which I can\'t read together to give fit() function.
I want to give iterate over all my data which is
Support Vector Machine (at least as implemented in libsvm which scikit-learn is a wrapper of) is fundamentally a batch algorithm: it needs to have access to all the data in memory at once. Hence they are not scalable.
Instead you should use models that support incremental learning with the partial_fit
method. For instance some linear models such as sklearn.linear_model.SGDClassifier
support the partial_fit
method. You can slice your dataset and load it as a sequence of minibatches with shape (batch_size, n_features)
. batch_size
can be 1 but is not efficient because the of the python interpreter overhead (+ the data load overhead). So it is recommended to lead samples by minitaches of a least 100.