In tensorflow, can you use non-smooth function as loss function, such as piece-wise (or with if-else)? If you cant, why you can use ReLU?
In this link S
tf does not compute gradients for all functions automatically, even if one uses some backend functions. Please see. Errors when Building up a Custom Loss Function for a task I did, then I found out the answer myself.
That being said, one may only approximate a piece-wise differentiable functions so as to implement, for example, piece-wise constant/step functions. The following is my implementation as per such an idea in MATLAB. One may easily extend it to cases with more thresholds (junctures) and desire boundary conditions.
function [s, ds] = QPWC_Neuron(z, sharp)
% A special case of (quadraple) piece-wise constant neuron composing of three Sigmoid functions
% There are three thresholds (junctures), 0.25, 0.5, and 0.75, respectively
% sharp determines how steep steps are between two junctures.
% The closer a point to one of junctures, the smaller its gradient will become. Gradients at junctures are zero.
% It deals with 1D signal only are present, and it must be preceded by another activation function, the output from which falls within [0, 1]
% Example:
% z = 0:0.001:1;
% sharp = 100;
LZ = length(z);
s = zeros(size(z));
ds = s;
for l = 1:LZ
if z(l) <= 0
s(l) = 0;
ds(l) = 0;
elseif (z(l) > 0) && (z(l) <= 0.25)
s(l) = 0.25 ./ (1+exp(-sharp*((z(l)-0.125)./0.25)));
ds(l) = sharp/0.25 * (s(l)-0) * (1-(s(l)-0)/0.25);
elseif (z(l) > 0.25) && (z(l) <= 0.5)
s(l) = 0.25 ./ (1+exp(-sharp*((z(l)-0.375)./0.25))) + 0.25;
ds(l) = sharp/0.25 * (s(l)-0.25) * (1-(s(l)-0.25)/0.25);
elseif (z(l) > 0.5) && (z(l) <= 0.75)
s(l) = 0.25 ./ (1+exp(-sharp*((z(l)-0.625)./0.25))) + 0.5;
ds(l) = sharp/0.25 * (s(l)-0.5) * (1-(s(l)-0.5)/0.25);
elseif (z(l) > 0.75) && (z(l) < 1)
% If z is larger than 0.75, the gradient shall be descended to it faster than other cases
s(l) = 0.5 ./ (1+exp(-sharp*((z(l)-1)./0.5))) + 0.75;
ds(l) = sharp/0.5 * (s(l)-0.75) * (1-(s(l)-0.75)/0.5);
else
s(l) = 1;
ds(l) = 0;
end
end
figure;
subplot 121, plot(z, s); xlim([0, 1]);grid on;
subplot 122, plot(z, ds); xlim([0, 1]);grid on;
end
The problem is not with the loss being piece-wise or non-smooth. The problem is that we need a loss function that can send back a non-zero gradient to the network parameters (dloss/dparameter) when there is an error between the output and the expected output. This applies to almost any function used inside the model (e.g. loss functions, activation functions, attention functions).
For example, Perceptrons use a unit step H(x) as an activation function (H(x) = 1 if x > 0 else 0). since the derivative of H(x) is always zero (undefined at x=0), No gradient coming from the loss will pass through it back to the weights (chain rule), so no weights before that function in the network can be updated using gradient descent. Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).
For Relu, the derivative is 1 for x > 0 and 0 otherwise. while the derivative is undefined at x=0, we still can back-propagate the loss gradient through it when x>0. That's why it can be used.
That is why we need a loss function that has a non-zero gradient. Functions like accuracy and F1 have zero gradients everywhere (or undefined at some values of x), so they can't be used, while functions like cross-entropy, L2 and L1 have non-zero gradients, so they can be used. (note that L1 "absolute difference" is piece-wise and not smooth at x=0 but still can be used)
In case you must use a function that doesn't meet the above criteria, try reinforcement learning methods instead (e.g. Policy gradient).
As far as Question #3 of OP goes, you actually don't have to implement the gradient computations yourself. Tensorflow will do that for you, which is one of the things I love about it!