Going through this book, I am familiar with the following:
For each training instance the backpropagation algorithm first makes a prediction (forward pa
Thanks to the answer by David Parks for the valid contribution and useful links, however I have found the answer to this question by the author of the book himself, which may provide a more concise answer:
Bakpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps, each of which computes gradients and uses them to perform a Gradient Descent step. In contrast, reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation.
The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a subset of reverse-mode AD.
When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.
Whether or not backpropagation takes the more general definition of reverse-mode AD as applied to a scalar loss function, or the more specific definition of reverse-mode AD as applied to a scalar loss function for training neural networks is a matter of personal taste. It's a word that has slightly different meaning in different contexts, but is most commonly used in the machine learning community to talk about computing gradients of neural network parameters using a scalar loss function.
For completeness: Sometimes reverse-mode AD can compute the full Jacobian on a single reverse pass, not just the vector-Jacobian product. Also, the vector Jacobian product for a scalar function where the vector is the vector [1.0] is the same as the gradient.
Automatic differentiation differs from the method taught in standard calculus classes on how gradients are computed, and in some features such as its native ability to take the gradient of a data structure and not just a well defined mathematical function. I'm not expert enough to go into further detail, but this is a great reference that explains it in much more depth:
https://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/
Here's another guide that looks quite nice that I just found now.
https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation
I believe backprop may formally refer to the by-hand calculus algorithm for computing gradients, at least that's how it was originally derived and is how it's taught in classes on the subject. But in practice, backprop is used quite interchangeably with the automatic differentiation approach described in the above guides. So splitting those two terms is probably as much an effort in linguistics as it is mathematics.
I also noted this nice article on the backpropagation algorithm to compare against the above guides on automatic differentiation.
https://brilliant.org/wiki/backpropagation/