Why input is scaled in tf.nn.dropout in tensorflow?

前端 未结 4 1958
自闭症患者
自闭症患者 2021-01-30 13:31

I can\'t understand why dropout works like this in tensorflow. The blog of CS231n says that, \"dropout is implemented by only keeping a neuron active with some probability

4条回答
  •  一向
    一向 (楼主)
    2021-01-30 13:51

    Here is a quick experiment to disperse any remaining confusion.

    Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.

    Then consider the following experiment:

    DIM = 1_000_000                      # set our dims for weights and input
    x = np.ones((DIM,1))                 # our input vector
    #x = np.random.rand(DIM,1)*2-1.0     # or could also be a more realistic normalized input
    
    probs = [1.0, 0.7, 0.5, 0.3]         # define dropout probs
    
    W = np.random.normal(size=(DIM,1))   # sample normally distributed weights
    print("W-mean = ", W.mean())         # note the mean is not perfect --> sampling error!
    
    # DO THE DRILL
    h = defaultdict(list)
    for i in range(1000):
      for p in probs:
        M = np.random.rand(DIM,1)
        M = (M < p).astype(int)
        Wp = W * M
        a = np.dot(Wp.T, x)
        h[str(p)].append(a)
    
    for k,v in h.items():
      print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
    

    Sample output:

    x-mean =  1.0
    W-mean =  -0.001003985674840264
    For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
    For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
    For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
    For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
    

    Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.

    Can you spot an obvious correlation between the W-mean and the average linear activation means?

提交回复
热议问题