Predictions from a model become very small. The loss is either 0 or a positive constant

对着背影说爱祢 提交于 2020-05-17 06:04:52


I am implementing the following architecture in Tensorflow.

Dual Encoder LSTM

During the first few iterations, the loss remains 0.6915 but after that as you can see in the output below, no matter how many iterations I run, the loss keeps varying between -0.0 and a positive constant depending upon the hyperparams. This is happening because the predictions of my model become very small(close to zero) or very high (close to 1). So the model cannot be trained. What can be the reason for such small or large predictions ? and what can I do to correct it ?

Input_C = (160,1)

Input_R = (160,1)

Batch Size = 1

C = (Batch_Size,256)

R = (Batch_Size,256)

Below is my model along with input shapes: enter image description here

Below is sample output:

Training loss (for one batch) at step 0: 0.691542387008667
Seen so far: 1 samples
Training loss (for one batch) at step 200: 0.6671515703201294
Seen so far: 201 samples
Training loss (for one batch) at step 400: -0.0
Seen so far: 401 samples
Training loss (for one batch) at step 600: -0.0
Seen so far: 601 samples
Training loss (for one batch) at step 800: -0.0
Seen so far: 801 samples
Training loss (for one batch) at step 1000: -0.0
Seen so far: 1001 samples
Training loss (for one batch) at step 1200: -0.0
Seen so far: 1201 samples
Training loss (for one batch) at step 1400: -0.0
Seen so far: 1401 samples
Training loss (for one batch) at step 1600: 15.424948692321777
Seen so far: 1601 samples
Training loss (for one batch) at step 1800: -0.0
Seen so far: 1801 samples
Training loss (for one batch) at step 2000: 15.424948692321777
Seen so far: 2001 samples
Training loss (for one batch) at step 2200: -0.0
Seen so far: 2201 samples
Training loss (for one batch) at step 2400: -0.0
Seen so far: 2401 samples
Training loss (for one batch) at step 2600: -0.0
Seen so far: 2601 samples
Training loss (for one batch) at step 2800: -0.0
Seen so far: 2801 samples
Training loss (for one batch) at step 3000: -0.0
Seen so far: 3001 samples
Training loss (for one batch) at step 3200: 15.424948692321777
Seen so far: 3201 samples
Training loss (for one batch) at step 3400: 15.424948692321777
Seen so far: 3401 samples
Training loss (for one batch) at step 3600: -0.0
Seen so far: 3601 samples
Training loss (for one batch) at step 3800: 15.424948692321777
Seen so far: 3801 samples
Training loss (for one batch) at step 4000: 15.424948692321777
Seen so far: 4001 samples
Training loss (for one batch) at step 4200: -0.0
Seen so far: 4201 samples
Training loss (for one batch) at step 4400: 15.424948692321777
Seen so far: 4401 samples
Training loss (for one batch) at step 4600: -0.0
Seen so far: 4601 samples
Training loss (for one batch) at step 4800: 15.424948692321777
Seen so far: 4801 samples
Training loss (for one batch) at step 5000: 15.424948692321777
Seen so far: 5001 samples
Training loss (for one batch) at step 5200: -0.0
Seen so far: 5201 samples
Training loss (for one batch) at step 5400: -0.0

Below are the values of prediction of sigmoid(CMR). You can see it suddenly vanishes after a few iterations.

Prediction : tf.Tensor([[0.50066364]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49867386]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49919522]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.4999423]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49848711]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.499426]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49959162]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49965566]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.50021386]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.4996987]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49993336]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49861637]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.50016826]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49728978]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49540216]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49112904]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.49182785]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.44881523]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[0.01220286]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[9.062928e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.7185716e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.3001763e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.934234e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.2812477e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.1744075e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.306665e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.9167836e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.2757072e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.403139e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.9167836e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.3142985e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.916903e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.3480556e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.2885927e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.9167836e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[1.2939568e-13]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.9167836e-14]], shape=(1, 1), dtype=float32)
Prediction : tf.Tensor([[6.916797e-14]], shape=(1, 1), dtype=float32)

Below are predictions (sigmoid(CMR)), loss and label values printed in console:

Prediction : tf.Tensor([[1.4857496e-12]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([1], shape=(1,), dtype=int64)
Loss : tf.Tensor([15.424949], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.0175745e-11]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([0], shape=(1,), dtype=int64)
Loss : tf.Tensor([-0.], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.9670995e-10]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([0], shape=(1,), dtype=int64)
Loss : tf.Tensor([-0.], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.7731953e-10]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([1], shape=(1,), dtype=int64)
Loss : tf.Tensor([15.424949], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.986521e-10]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([0], shape=(1,), dtype=int64)
Loss : tf.Tensor([-0.], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.6696887e-13]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([1], shape=(1,), dtype=int64)
Loss : tf.Tensor([15.424949], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.9859603e-10]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([0], shape=(1,), dtype=int64)
Loss : tf.Tensor([-0.], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.9074237e-12]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([1], shape=(1,), dtype=int64)
Loss : tf.Tensor([15.424949], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.9804261e-10]], shape=(1, 1), dtype=float32)
Label : tf.Tensor([0], shape=(1,), dtype=int64)
Loss : tf.Tensor([-0.], shape=(1,), dtype=float32)
Prediction : tf.Tensor([[1.9823462e-10]], shape=(1, 1), dtype=float32)

Below is my code.

    encoder = Sequential()
    encoder.add(Embedding(input_dim = MAX_NB_WORDS,output_dim = EMBEDDING_DIM,input_length = MAX_SENTENCE_LENGTH))
    encoder.add(LSTM(units = 256))

    # Create tensors for Context and Utterance
    context_input = Input(shape=(MAX_SENTENCE_LENGTH,),dtype='float32')
    utterance_input = Input(shape=(MAX_SENTENCE_LENGTH,),dtype='float32')

    # Encode Context and Utterance through LSTM
    encoded_context = encoder(context_input)            # Shape = (None,256)
    encoded_utterance = encoder(utterance_input)        # Actual response encoding (None,256) --> Need to take its transpose to make dimenions add up

    """Use Custom layer to make GradientTape work"""
    custom_layer = CustomLayer(256,256)
    generated_response = custom_layer(encoded_context)

    projection = tf.linalg.matmul(generated_response,tf.transpose(encoded_utterance))
    probability = tf.math.sigmoid(projection)

    dual_encoder = Model(inputs=[context_input,utterance_input],outputs = probability)
    print("Trainable variables :",dual_encoder.trainable_weights)
    plot_model(dual_encoder, os.path.join(OUTPUT_PATH,'my_first_model.png'),show_shapes = True)

    #dual_encoder.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop',metrics=['accuracy'])
    print("Summary of Dual Encoder LSTM :",dual_encoder.summary())
    def create_batched_dataset(data_path):
        tfrecord_dataset =,"train.tfrecords"))
        parsed_dataset =,num_parallel_calls = 8)
        parsed_dataset = parsed_dataset.repeat()
        parsed_dataset = parsed_dataset.shuffle(SHUFFLE_BUFFER)
        parsed_dataset = parsed_dataset.batch(BATCH_SIZE)
        # iterator =
        # batched_context,batched_utterance,batched_labels = iterator.get_next()
        return parsed_dataset

    parsed_dataset = create_batched_dataset(OUTPUT_PATH)

    ''' Attempting GradientTape '''

    # reference -
    optimizer = RMSprop(learning_rate=0.001, rho=0.9, momentum=0.1, epsilon=1e-07, centered=False)

    epochs = 10
    for epoch in range(epochs):
        print('Start of epoch %d' % (epoch,))

      # Iterate over the batches of the dataset.
        for step, row in enumerate(parsed_dataset):
            input_batch_context,input_batch_utterance,input_batch_label = row
            #print("Context :",input_batch_context)
            with tf.GradientTape() as tape:

                # Run the forward pass of the layer. The operations that the layer applies to its inputs are going to be recorded on the GradientTape.
                pred = dual_encoder([input_batch_context, input_batch_utterance])
                #print("Prediction :",pred)
                #print("Label :",input_batch_label)
                # Compute the loss value for this minibatch.
                loss_value = binary_crossentropy(input_batch_label, pred)
                #print("Loss :",loss_value)

            # Use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
            grads = tape.gradient(loss_value, dual_encoder.trainable_weights)

            # Run one step of gradient descent by updating the value of the variables to minimize the loss.
            optimizer.apply_gradients(zip(grads, dual_encoder.trainable_weights))

            # Log every 200 batches.
            if step % 200 == 0:
                print('Training loss (for one batch) at step %s: %s' % (step, float(loss_value)))
                print('Seen so far: %s samples' % ((step + 1) * BATCH_SIZE))

