Subsampling an unbalanced dataset in tensorflow

跟風遠走 提交于 2021-01-28 10:55:45

问题


Tensorflow beginner here. This is my first project and I am working with pre-defined estimators.

I have an extremely unbalanced dataset where positive outcomes represent roughly 0.1% of the total data and I suspect this imbalance to considerably affect the performance of my model. As a first attempt to solve the issue, since I have tons of data, I would like to throw away most of my negatives in order to create a balanced dataset. I can see two ways of doing it: preprocess the data to keep only a thousandth of the negatives then save it in a new file before passing it to tensorflow, for example with pyspark; and asking tensorflow to use only one negative out of a thousand it finds.

I tried to code this last idea but didn't manage. I modified my input function to read like

def train_input_fn(data_file="../data/train_input.csv", shuffle_size=100_000, batch_size=128):
    """Generate an input function for the Estimator."""

    dataset = tf.data.TextLineDataset(data_file)  # Extract lines from input files using the Dataset API.
    dataset = dataset.map(parse_csv, num_parallel_calls=3)
    dataset = dataset.shuffle(shuffle_size).repeat().batch(batch_size)

    iterator = dataset.make_one_shot_iterator()
    features, labels = iterator.get_next()

    # TRY TO IMPLEMENT THE SELECTION OF NEGATIVES
    thrown = 0
    flag = np.random.randint(1000)
    while labels == 0 and flag != 0:
        features, labels = iterator.get_next()
        thrown += 1
        flag = np.random.randint(1000)
    print("I've thrown away {} negative examples before going for label {}!".format(thrown, labels))
    return features, labels

This, of course, doesn't work because iterators don't know what's inside them, so the labels==0 condition is never satisfied. Also, there is only one print in the stdout, meaning that this function is only called once (and meaning that I still don't understand how tensorflow really works). Anyways, is there a way to implement what I want?

PS: I suspect that the previous code, even if it worked as intended, would return less than a thousandth of the initial negatives due to the count restarting every time it finds a positive. This is a minor issue, and so far I could even find a magic number inside the flag that gives me the expected result without worrying too much about the mathematical beauty of it.


回答1:


You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.

The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave to sample equally from both datasets.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave



来源:https://stackoverflow.com/questions/49735127/subsampling-an-unbalanced-dataset-in-tensorflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!