balancing an imbalanced dataset with keras image generator

前端 未结 1 668
南旧
南旧 2021-02-01 15:55

The keras

ImageDataGenerator

can be used to \"Generate batches of tensor image data with real-time data augmentation\"

The tutorial he

相关标签:
1条回答
  • 2021-02-01 16:43

    This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.

    The more standard approaches would be:

    • the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
    • reducing the size of the majority class.
    • accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).

    The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).

    The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).

    If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.

    0 讨论(0)
提交回复
热议问题