Scalable or online out-of-core multi-label classifiers

前端 未结 4 1877
無奈伤痛
無奈伤痛 2021-02-04 08:39

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.<

相关标签:
4条回答
  • 2021-02-04 09:09

    I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)

    Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.

    0 讨论(0)
  • 2021-02-04 09:10

    My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:

    from skmultilearn.ensemble import LabelSpacePartitioningClassifier
    from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
    from skmultilearn.problem_transform import LabelPowerset
    
    from sklearn.linear_model import SGDClassifier
    
    # base multi-class classifier SGD
    base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)
    
    # problem transformation from multi-label to single-label multi-class
    transformation_classifier = LabelPowerset(base_classifier)
    
    # clusterer dividing the label space using fast greedy modularity maximizing scheme
    clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True) 
    
    # ensemble
    clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
    
    clf.fit(x_train, y_train)
    prediction = clf.predict(x_test)
    
    0 讨论(0)
  • 2021-02-04 09:27
    1. The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
    2. Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
    3. If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
    4. The trick in (1.) applies to any classification algorithm.

    As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

    0 讨论(0)
  • 2021-02-04 09:31

    The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).

    The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.

    0 讨论(0)
提交回复
热议问题