Classification: skewed data within a class

大憨熊 提交于 2020-01-02 02:26:09

问题


I'm trying to build a multilabel-classifier to predict the probabilities of some input data being either 0 or 1. I'm using a neural network and Tensorflow + Keras (maybe a CNN later).

The problem is the following: The data is highly skewed. There are a lot more negative examples than positive maybe 90:10. So my neural network nearly always outputs very low probabilities for positive examples. Using binary numbers it would predict 0 in most of the cases.

The performance is > 95% for nearly all classes, but this is due to the fact that it nearly always predicts zero... Therefore the number of false negatives is very high.

Some suggestions how to fix this?

Here are the ideas I considered so far:

  1. Punishing false negatives more with a customized loss function (my first attempt failed). Similar to class weighting positive examples inside a class more than negative ones. This is similar to class weights but within a class. How would you implement this in Keras?

  2. Oversampling positive examples by cloning them and then overfitting the neural network such that positive and negative examples are balanced.

Thanks in advance!


回答1:


You're on the right track.

Usually, you would either balance your data set before training, i.e. reducing the over-represented class or generate artificial (augmented) data for the under-represented class to boost its occurrence.

  1. Reduce over-represented class This one is simpler, you would just randomly pick as many samples as there are in the under-represented class, discard the rest and train with the new subset. The disadvantage of course is that you're losing some learning potential, depending on how complex (how many features) your task has.

  2. Augment data Depending on the kind of data you're working with, you can "augment" data. That just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, sound data. You could flip/rotate, scale, add-noise, in-/decrease brightness, scale, crop etc. The important thing here is that you stay within bounds of what could happen in the real world. If for example you want to recognize a "70mph speed limit" sign, well, flipping it doesn't make sense, you will never encounter an actual flipped 70mph sign. If you want to recognize a flower, flipping or rotating it is permissible. Same for sound, changing volume / frequency slighty won't matter much. But reversing the audio track changes its "meaning" and you won't have to recognize backwards spoken words in the real world.

Now if you have to augment tabular data like sales data, metadata, etc... that's much trickier as you have to be careful not to implicitly feed your own assumptions into the model.




回答2:


I think your two suggestions are already quite good. You can also simply undersample the negativ class, of course.

def balance_occurences(dataframe, zielspalte=target_name, faktor=1):
    least_frequent_observation=dataframe[zielspalte].value_counts().idxmin()
    bottleneck=len(dataframe[dataframe[zielspalte]==least_frequent_observation])
    balanced_indices=dataframe.index[dataframe[zielspalte]==least_frequent_observation].tolist()
    for value in (set(dataframe[zielspalte])-{least_frequent_observation}):
        full_list=dataframe.index[dataframe[zielspalte]==value].tolist()
        selection=np.random.choice(a=full_list,size=bottleneck*faktor, replace=False)
        balanced_indices=np.append(balanced_indices,selection)
    df_balanced=dataframe[dataframe.index.isin(balanced_indices)]
    return df_balanced

Your loss function could look into the recall of the positive class combined with some other measurement.



来源:https://stackoverflow.com/questions/48880273/classification-skewed-data-within-a-class

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!