Why must a nonlinear activation function be used in a backpropagation neural network? [closed]

后端未结

关注

 13  2070

傲寒

相关标签:

13条回答

粉色の甜心

2020-11-29 15:14

"The present paper makes use of the Stone-Weierstrass Theorem and the cosine squasher of Gallant and White to establish that standard multilayer feedforward network architectures using abritrary squashing functions can approximate virtually any function of interest to any desired degree of accuracy, provided sufficently many hidden units are available." (Hornik et al., 1989, Neural Networks)

A squashing function is for example a nonlinear activation function that maps to [0,1] like the sigmoid activation function.

0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-11-29 15:23
It is important to use the nonlinear activation function in neural networks, especially in deep NNs and backpropagation. According to the question posed in the topic, first I will say the reason for the need to use the nonlinear activation function for the backpropagation.

Simply put: if a linear activation function is used, the derivative of the cost function is a constant with respect to (w.r.t) input, so the value of input (to neurons) does not affect the updating of weights. This means that we can not figure out which weights are most effective in creating a good result and therefore we are forced to change all weights equally.

Deeper: In general, weights are updated as follows:
```
W_new = W_old - Learn_rate * D_loss
```
This means that the new weight is equal to the old weight minus the derivative of the cost function. If the activation function is a linear function, then its derivative w.r.t input is a constant, and the input values have no direct effect on the weight update.

For example, we intend to update the weights of last layer neurons using backpropagation. We need to calculate the gradient of the weight function w.r.t weight. With chain rule we have:

h and y are (estimated) neuron output and actual output value, respectively. And x is the input of neurons. grad (f) is derived from the input w.r.t activation function. The value calculated above (by a factor) is subtracted from the current weight and a new weight is obtained. We can now compare these two types of activation functions more clearly.

1- If the activating function is a linear function, such as: F(x) = 2 * x

then:

the new weight will be:

As you can see, all the weights are updated equally and it does not matter what the input value is!!

2- But if we use a non-linear activation function like Tanh(x) then:

and:

and now we can see the direct effect of input in updating weights! different input value makes different weights changes.

I think the above is enough to answer the question of the topic but it is useful to mention other benefits of using the non-linear activation function.

As mentioned in other answers, non-linearity enables NNs to have more hidden layers and deeper NNs. A sequence of layers with a linear activator function can be merged as a layer (with a combination of previous functions) and is practically a neural network with a hidden layer, which does not take advantage of the benefits of deep NN.

Non-linear activation function can also produce a normalized output.
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-11-29 15:26

Several good answers are here. It will be good to point out the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. It is a book worth referring to for getting a deeper insight about several ML related concepts. Excerpt from page 229 (section 5.1):

If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units.

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-11-29 15:26

As I remember - sigmoid functions are used because their derivative that fits in BP algorithm is easy to calculate, something simple like f(x)(1-f(x)). I don't remember exactly the math. Actually any function with derivatives can be used.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-11-29 15:27

There are times when a purely linear network can give useful results. Say we have a network of three layers with shapes (3,2,3). By limiting the middle layer to only two dimensions, we get a result that is the "plane of best fit" in the original three dimensional space.

But there are easier ways to find linear transformations of this form, such as NMF, PCA etc. However, this is a case where a multi-layered network does NOT behave the same way as a single layer perceptron.

0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-11-29 15:27

A layered NN of several neurons can be used to learn linearly inseparable problems. For example XOR function can be obtained with two layers with step activation function.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页

热议问题