After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives
You want to eliminate the conditional from inside your loop here:
const double lower_layer_output = i > 0 ? outputs[lower_layer][k] : input[k]; // input layer semantics
You can eliminate this condition by calculting the zero'th iteration (the special case of i==0) earlier.
deltas[i][j][k] = delta;
weights[i][j][k] += delta;
You mention using std::vector, so this is a vector of vector of vector? Your data is not going to be contiguous (except in the sense that each vector is contigous). Consider using C style arrays.
How big are those dimensions? There may be some caching considerations if very large. E.g. you don't want that last subscript [k] to flush the L1 cache. Sometimes breaking the loop to process a smaller range of k indexes at a time can help (strip mining).
You can also experiment with unrolling your inner loops a little, e.g. try doing 4 or eight operations inside the loop. Increment by 4/8 respectively and handle any remainder in another loop. The compiler may be doing that already.
As others have mentioned using SIMD (SSE/AVX) is probably where you can find the most gain. You can either use compiler intrinsics (link is to Visual Studio but gcc has support with same syntax) or write in assembly (inlined or otherwise). As you mentioned, scaling across multiple cores is another direction. OpenMP can help you do that without a lot of pain.
Sometimes it is useful to generate an annotated assembly listing from your code to try and see where the compiler isn't doing such a great jobs.
This is an excellent general resource about optimization.