I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.<
My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.linear_model import SGDClassifier
# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)
# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)