How to implement the Softmax derivative independently from any loss function?

前端未结

关注

 4  1653

夕颜 2021-02-05 16:04

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output

4条回答

佛祖请我去吃肉 (楼主)

2021-02-05 16:54

Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):

// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for 
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and 
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8


//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
    #define assert_is_m256_multiple(x)  assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
    #define assert_is_m256_multiple (q) 
#endif


// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
    const float *sumStart = reinterpret_cast(&x);
    float sum = 0.0f;

    for(size_t i=0; i





If for some reason somebody wants a simple (non-SSE) version, here it is:

inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,  
                                                 const float *gradFromAbove,  //<--gradient vector, flowing into us from the above layer
                                                 float *gradOutput,  
                                                 size_t count ){
    // every pre-softmax element in a layer contributed to the softmax of every other element
    // (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
    for(size_t i=0; i