First of all, give up any notions that artificial neural networks have anything to do with the brain but for a passing similarity to networks of biological neurons. Learning biology won't help you effectively apply neural networks; learning linear algebra, calculus, and probability theory will. You should at the very least make yourself familiar with the idea of basic differentiation of functions, the chain rule, partial derivatives (the gradient, the Jacobian and the Hessian), and understanding matrix multiplication and diagonalization.
Really what you are doing when you train a network is optimizing a large, multidimensional function (minimizing your error measure with respect to each of the weights in the network), and so an investigation of techniques for nonlinear numerical optimization may prove instructive. This is a widely studied problem with a large base of literature outside of neural networks, and there are plenty of lecture notes in numerical optimization available on the web. To start, most people use simple gradient descent, but this can be much slower and less effective than more nuanced methods like
Once you've got the basic ideas down you can start to experiment with different "squashing" functions in your hidden layer, adding various kinds of regularization, and various tweaks to make learning go faster. See this paper for a comprehensive list of "best practices".
One of the best books on the subject is Chris Bishop's Neural Networks for Pattern Recognition. It's fairly old by this stage but is still an excellent resource, and you can often find used copies online for about $30. The neural network chapter in his newer book, Pattern Recognition and Machine Learning, is also quite comprehensive. For a particularly good implementation-centric tutorial, see this one on CodeProject.com which implements a clever sort of network called a convolutional network, which constrains connectivity in such a way as to make it very good at learning to classify visual patterns.
Support vector machines and other kernel methods have become quite popular because you can apply them without knowing what the hell you're doing and often get acceptable results. Neural networks, on the other hand, are huge optimization problems which require careful tuning, although they're still preferable for lots of problems, particularly large scale problems in domains like computer vision.