information-gain

Python Information gain implementation

瘦欲@ 提交于 2019-12-24 02:13:19
问题 I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post. I came up with the following solution: from scipy.stats import entropy import numpy as np def information_gain(X, y): def _entropy(labels):

Feature importance 'gain' in XGBoost

落爺英雄遲暮 提交于 2019-12-14 03:56:23
问题 I want to understand how the feature importance in xgboost is calculated by 'gain'. From https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7: ‘Gain’ is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is