Handling missing data for the main loss, which is present for auxiliary loss

问题

I want to construct a Keras model for a dataset with a main target and an auxiliary target. I have data for the auxiliary target for all entries in my dataset, but for the main target I have data only for a subset of all data points. Consider the following example, which is supposed to predict

max(min(x1, x2), x3)

but for some values it is only given my auxiliary target, min(x1, x2).

from keras.models import Model
from keras.optimizers import Adadelta
from keras.losses import mean_squared_error
from keras.layers import Input, Dense

import tensorflow as tf
import numpy

input = Input(shape=(3,))

hidden = Dense(2)(input)
min_pred = Dense(1)(hidden)
max_min_pred = Dense(1)(hidden)

model = Model(inputs=[input],
              outputs=[min_pred, max_min_pred])

model.compile(
    optimizer=Adadelta(),
    loss=mean_squared_error,
    loss_weights=[0.2, 1.0])

def random_values(n, missing=False):
    for i in range(n):
        x = numpy.random.random(size=(4, 3))
        _min = numpy.minimum(x[..., 0], x[..., 1])
        if missing:
            _max_min = numpy.full((len(x), 1), numpy.nan)
        else:
            _max_min = numpy.maximum(_min, x[..., 2]).reshape((-1, 1))
        yield x, [numpy.array(_min).reshape((-1, 1)), numpy.array(_max_min)]

model.fit_generator(random_values(50, False),
                    steps_per_epoch=50)
model.fit_generator(random_values(5, True),
                    steps_per_epoch=5)
model.fit_generator(random_values(50, False),
                    steps_per_epoch=50)

Obviously, the code above does not work – having a target of NaN means a loss of NaN which means a weight adaption of NaN, so weights go to NaN and the model becomes useless. (Also, instantiating the entire NaN array is wasteful, but in principle my missing data can be part of any batch with data present, so for the sake of having homogenous arrays it seems reasonable.)

My code does not have to work with all keras backends, tensorflow-only code is fine. I have tried changing the loss function,

def loss_0_where_nan(loss_function):
    def filtered_loss_function(y_true, y_pred):
        with_nans = loss_function(y_true, y_pred)
        nans = tf.is_nan(with_nans)
        return tf.where(nans, tf.zeros_like(with_nans), with_nans)
    return filtered_loss_function

and using loss_0_where_nan(mean_squared_error) as new loss function, but it still introduces NaNs.

How should I handle missing target data for the main prediction output where I have auxiliary target data? Will masking help?

回答1:

In your question, you present the case where missing data comes in predictable chunks in your dataset. If you can separate out missing data and existing data like that, you can use

truncated_model = Model(inputs=[input],
                        outputs=[min_pred])

truncated_model.compile(
    optimizer=Adadelta(),
    loss=[mean_squared_error])

to define a model that shares some layers with your complete model, and then replace

model.fit_generator(random_values(5, True),
                    steps_per_epoch=5)

with

def partial_data(entry):
   x, (y0, y1) = entry
   return x, y0

truncated_model.fit_generator(map(partial_data, random_values(5, True)),
                              steps_per_epoch=5)

to train the truncated model on the non-missing data.

Given this level of control over your input data providers, you can obviously adapt your random_values method such that it does not even generate the data that partial_data immediately throws away again, but I thought this would be the clearer way to present the necessary changes.

来源：https://stackoverflow.com/questions/52106360/handling-missing-data-for-the-main-loss-which-is-present-for-auxiliary-loss

标签

python

keras

missing-data