Scalable or online out-of-core multi-label classifiers

前端 未结 4 1874
無奈伤痛
無奈伤痛 2021-02-04 08:39

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.<

4条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-04 09:27

    1. The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
    2. Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
    3. If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
    4. The trick in (1.) applies to any classification algorithm.

    As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

提交回复
热议问题