It is important to use the nonlinear activation function in neural networks, especially in deep NNs and backpropagation. According to the question posed in the topic, first I will say the reason for the need to use the nonlinear activation function for the backpropagation.
Simply put: if a linear activation function is used, the derivative of the cost function is a constant with respect to (w.r.t) input, so the value of input (to neurons) does not affect the updating of weights. This means that we can not figure out which weights are most effective in creating a good result and therefore we are forced to change all weights equally.
Deeper: In general, weights are updated as follows:
W_new = W_old - Learn_rate * D_loss
This means that the new weight is equal to the old weight minus the derivative of the cost function. If the activation function is a linear function, then its derivative w.r.t input is a constant, and the input values have no direct effect on the weight update.
For example, we intend to update the weights of last layer neurons using backpropagation. We need to calculate the gradient of the weight function w.r.t weight. With chain rule we have:
h and y are (estimated) neuron output and actual output value, respectively. And x is the input of neurons. grad (f) is derived from the input w.r.t activation function. The value calculated above (by a factor) is subtracted from the current weight and a new weight is obtained. We can now compare these two types of activation functions more clearly.
1- If the activating function is a linear function, such as:
F(x) = 2 * x
then:
the new weight will be:
As you can see, all the weights are updated equally and it does not matter what the input value is!!
2- But if we use a non-linear activation function like Tanh(x) then:
and:
and now we can see the direct effect of input in updating weights! different input value makes different weights changes.
I think the above is enough to answer the question of the topic but it is useful to mention other benefits of using the non-linear activation function.
As mentioned in other answers, non-linearity enables NNs to have more hidden layers and deeper NNs. A sequence of layers with a linear activator function can be merged as a layer (with a combination of previous functions) and is practically a neural network with a hidden layer, which does not take advantage of the benefits of deep NN.
Non-linear activation function can also produce a normalized output.