After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives
I'm not fond of valarray, but I have a hunch that there is quite some opportunity for those around here.
Blitz++ (boost) seems to have a better aura around the web, but I don't know it :)
I was starting to work on a PoC myself, but there are too many missing bits of code
void activate(const double input[]) { /* ??? */ }
const unsigned int n_layers_ns;
const unsigned int n_layers;
const unsigned int output_layer_s;
const unsigned int output_layer;
T/*double?*/ bias = 1/*.0f?/;
const unsigned int config[];
double outputs[][];
double errors [][];
double weights[][][];
double deltas [][][];
Now it follows logically from the code that at least the first (rank-0) indices to the arrays are defined by the 4 missing constants. If these constants can be known compile time, these would make great value class template parameters:
template
struct Backprop {
void train(const double input[], const double desired[], const double learn_rate, const double momentum);
void activate(const double input[]) { }
enum _statically_known
{
output_layer = n_layers_ns - 1,
output_layer_s = n_layers - 1, // output_layer with input layer semantics (for config use only)
n_hidden_layers = output_layer - 1,
};
static const double bias = 1.0f;
const unsigned int config[];
double outputs[3][50]; // if these dimensions could be statically known,
double errors[3][50]; // slap them in valarrays and
double weights[3][50][50]; // see what the compiler does with that!
double deltas[3][50][50]; //
};
template
void Backprop::train(const double input[], const double desired[], const double learn_rate, const double momentum) {
activate(input);
// calculated constants
const double inverse_momentum = 1.0 - momentum;
const unsigned int n_outputs = config[output_layer_s];
// calculate error for output layer
const double *output_layer_input = output_layer > 0 ? outputs[output_layer] : input; // input layer semantics
for (unsigned int j = 0; j < n_outputs; ++j) {
//errors[output_layer][j] = f'(outputs[output_layer][j]) * (desired[j] - outputs[output_layer][j]);
errors[output_layer][j] = gradient(output_layer_input[j]) * (desired[j] - output_layer_input[j]);
}
[... snip ...]
Notice how I reordered the statements in the first loop a bit, to make the loop trivial. Now, I can imagine those last lines becoming
// calculate error for output layer
const std::valarray output_layer_input = output_layer>0? outputs[output_layer] : input; // input layer semantics
errors[output_layer] = output_layer_input.apply(&gradient) * (desired - output_layer_input);
This will require the proper (g)slices to be setup for the inputs. I cannot work out how these would have to be dimensioned. The crux of the matter is that as long as these slice dimensions can be statically determined by the compiler, you will have the potential for signifcant time savings, since the compiler can optimize these into vectorized operations on either the FPU stack or the using the SSE4 instruction set. I suppose you would declare your output something like this:
std::valarray rawoutput(/*capacity?*/);
std::valarray outputs = rawoutput[std::slice(0, n_outputs, n_layers)]; // guesswork
(I expect weights and deltas would have to become gslices because AFAICT they are 3-dimensional)
I realized that there probably won't be much gain if the arrays ranks (dimensions) aren't optimally ordered (e.g. the first rank in a valarray is relatively small, say 8). This could hamper vectorization because the participating elements could be scattered in memory, where I suppose optimization requires them to be adjacent.
In this light it is important to realize that the 'optimal' ordering of the ranks is ultimately dependent on the access patterns alone (so, profile and inspect again).
Also, the opportunity for optimization might be hampered by unfortunate memory alignment [1]. In that light, you might want to switch the order of (val)array ranks and round rank dimensions to nearest powers of 2 (or mor practically, say multiples of 32 bytes).
If all this actually makes a big impact (profile/inspect generated code first!) I would imagine support
If the order of execution is not crucial (i.e. the relative orders of magnitude of factors are very similar), instead of
inverse_momentum * (learn_rate * ???)
you could take
(inverse_momentum * learn_rate) * ???
and precalculate the first subproduct. However, from the fact that it is explicitely ordered this way, I'm guessing that this would lead to more noise.
[1] disclaimer: I haven't actually done any analysis on that, I'm just throwing it out there so you don't miss the 'though joint' (how's that for Engrish)