Scalable or online out-of-core multi-label classifiers

前端未结

关注

 4  1874

無奈伤痛 2021-02-04 08:39

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.<

4条回答

谎友^ (楼主)

2021-02-04 09:27
1. The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
2. Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
3. If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
4. The trick in (1.) applies to any classification algorithm.
As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...