What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

前端 未结 2 872
不知归路
不知归路 2020-11-27 11:07

When trying to get cross-entropy with sigmoid activation function, there is a difference between

  1. loss1 = -tf.reduce_sum(p*tf.log(q), 1)
相关标签:
2条回答
  • 2020-11-27 11:16

    you can understand differences between softmax and sigmoid cross entropy in following way:

    1. for softmax cross entropy, it actually has one probability distribution
    2. for sigmoid cross entropy, it actually has multi independently binary probability distributions, each binary probability distribution can treated as two class probability distribution

    so anyway the cross entropy is:

       p * -tf.log(q)
    

    for softmax cross entropy it looks exactly as above formula,

    but for sigmoid, it looks a little different for it has multi binary probability distribution for each binary probability distribution, it is

    p * -tf.log(q)+(1-p) * -tf.log(1-q)
    

    p and (1-p) you can treat as two class probability within each binary probability distribution

    0 讨论(0)
  • 2020-11-27 11:19

    You're confusing the cross-entropy for binary and multi-class problems.

    Multi-class cross-entropy

    The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits:

    -tf.reduce_sum(p * tf.log(q), axis=1)
    

    p and q are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:

    p = tf.placeholder(tf.float32, shape=[None, 2])
    logit_q = tf.placeholder(tf.float32, shape=[None, 2])
    q = tf.nn.softmax(logit_q)
    
    feed_dict = {
      p: [[0, 1],
          [1, 0],
          [1, 0]],
      logit_q: [[0.2, 0.8],
                [0.7, 0.3],
                [0.5, 0.5]]
    }
    
    prob1 = -tf.reduce_sum(p * tf.log(q), axis=1)
    prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q)
    print(prob1.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]
    print(prob2.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]
    

    Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

    Binary cross-entropy

    This time the correct formula is

    p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
    

    Though mathematically it's a partial case of the multi-class case, the meaning of p and q is different. In the simplest case, each p and q is a number, corresponding to a probability of the class A.

    Important: Don't get confused by the common p * -tf.log(q) part and the sum. Previous p was a one-hot vector, now it's a number, zero or one. Same for q - it was a probability distribution, now's it's a number (probability).

    If p is a vector, each individual component is considered an independent binary classification. See this answer that outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0] doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2] means that each of 5 features is on with 20% probability.

    This explains the use of sigmoid function before the cross-entropy: its goal is to squash the logit to [0, 1] interval.

    The formula above still holds for multiple independent features, and that's exactly what tf.nn.sigmoid_cross_entropy_with_logits computes:

    p = tf.placeholder(tf.float32, shape=[None, 5])
    logit_q = tf.placeholder(tf.float32, shape=[None, 5])
    q = tf.nn.sigmoid(logit_q)
    
    feed_dict = {
      p: [[0, 0, 0, 1, 0],
          [1, 0, 0, 0, 0]],
      logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2],
                [0.3, 0.3, 0.2, 0.1, 0.1]]
    }
    
    prob1 = -p * tf.log(q)
    prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
    prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q))
    prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q)
    print(prob1.eval(feed_dict))
    print(prob2.eval(feed_dict))
    print(prob3.eval(feed_dict))
    print(prob4.eval(feed_dict))
    

    You should see that the last three tensors are equal, while the prob1 is only a part of cross-entropy, so it contains correct value only when p is 1:

    [[ 0.          0.          0.          0.59813893  0.        ]
     [ 0.55435514  0.          0.          0.          0.        ]]
    [[ 0.79813886  0.79813886  0.79813886  0.59813887  0.79813886]
     [ 0.5543552   0.85435522  0.79813886  0.74439669  0.74439669]]
    [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
     [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]
    [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
     [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]
    

    Now it should be clear that taking a sum of -p * tf.log(q) along axis=1 doesn't make sense in this setting, though it'd be a valid formula in multi-class case.

    0 讨论(0)
提交回复
热议问题