In tensorflow, can you use non-smooth function as loss function, such as piece-wise (or with if-else)? If you cant, why you can use ReLU?
In this link S
The problem is not with the loss being piece-wise or non-smooth. The problem is that we need a loss function that can send back a non-zero gradient to the network parameters (dloss/dparameter) when there is an error between the output and the expected output. This applies to almost any function used inside the model (e.g. loss functions, activation functions, attention functions).
For example, Perceptrons use a unit step H(x) as an activation function (H(x) = 1 if x > 0 else 0). since the derivative of H(x) is always zero (undefined at x=0), No gradient coming from the loss will pass through it back to the weights (chain rule), so no weights before that function in the network can be updated using gradient descent. Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).
For Relu, the derivative is 1 for x > 0 and 0 otherwise. while the derivative is undefined at x=0, we still can back-propagate the loss gradient through it when x>0. That's why it can be used.
That is why we need a loss function that has a non-zero gradient. Functions like accuracy and F1 have zero gradients everywhere (or undefined at some values of x), so they can't be used, while functions like cross-entropy, L2 and L1 have non-zero gradients, so they can be used. (note that L1 "absolute difference" is piece-wise and not smooth at x=0 but still can be used)
In case you must use a function that doesn't meet the above criteria, try reinforcement learning methods instead (e.g. Policy gradient).