I\'m writing some basic neural network methods - specifically the activation functions - and have hit the limits of my rubbish knowledge of math. I understand the respective ran
Generally the most important differences are a. smooth continuously differentiable like tanh and logistic vs step or truncated b. competitive vs transfer c. sigmoid vs radial d. symmetric (-1,+1) vs asymmetric (0,1)
Generally the differentiable requirement is needed for hidden layers and tanh is often recommended as being more balanced. The 0 for tanh is at the fastest point (highest gradient or gain) and not a trap, while for logistic 0 is the lowest point and a trap for anything pushing deeper into negative territory. Radial (basis) functions are about distance from a typical prototype and good for convex circular regions about a neuron, while the sigmoid functions are about separating linearly and good for half spaces - and it will require many for good approximation to a convex region, with circular/spherical regions being worst for sigmoids and best for radials.
Generally, the recommendation is for tanh on the intermediate layers for +/- balance, and suit the output layer to the task (boolean/dichotomous class decision with threshold, logistic or competitive outputs (e.g. softmax, a self-normalizing multiclass generalization of logistic); regression tasks can even be linear). The output layer doesn't need to be continuously differentiable. The input layer should be normalized in some way, either to [0,1] or better still standardization or normalization with demeaning to [-1,+1]. If you include a dummy input of 1 then normalize so ||x||p = 1 you are dividing by a sum or length and this magnitude information is retained in the dummy bias input rather than being lost. If you normalize over examples, this is technically interfering with your test data if you look at them, or they may be out of range if you don't. But with ||2 normalization such variations or errors should approach the normal distribution if they are effects of natural distribution or error. This means that they with high probability they won't exceed the original range (probably around 2 standard deviations) by more than a small factor (viz. such overrange values are regarded as outliers and not significant).
So I recommend unbiased instance normalization or biased pattern standardization or both on the input layer (possibly with data reduction with SVD), tanh on the hidden layers, and a threshold function, logistic function or competitive function on the output for classification, but linear with unnormalized targets or perhaps logsig with normalized targets for regression.