ARM NEON: How to implement a 256bytes Look Up table

问题

I am porting some code I wrote to NEON using inline assembly.

One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]

The table is short but the math behind this is not easy so I think it is not worth calculating it each time "on the fly". So I want to try Look Up tables.

I have used VTBL for a 32byte case, and works as expected

For the full range, one idea would be to first compare the range where the source is and do different lookups (i.e, having 4 32-bit lookup tables).

My question is: Is there any more efficient way to do it?

EDIT

After some trials, I have done it with four look-ups and (still not scheduled) I am happy with the results. I leave here a piece of the code lines in inline assembly, just in case someone may find it useful or thinks it can be improved.

// Have the original data in d0
// d1 holds #32 value 
// d6,d7,d8,d9 has the images for the values [0..31] 

    //First we look for the 0..31 images. The values out of range will be 0
    "vtbl.u8 d2,{d6,d7,d8,d9},d0    \n\t"

    // Now we sub #32 to d1 and find the images for [32...63], which have been previously loaded in d10,d11,d12,d13
    "vsub.u8 d0,d0,d1\n\t"              
    "vtbl.u8 d3,{d10,d11,d12,d13},d1    \n\t"

    // Do the same and calculating images for [64..95]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d4,{d14,d15,d16,d17},d0    \n\t"

    // Last step: images for [96..127]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d5,{d18,d19,d20,d21},d0    \n\t"

    // Now we add all. No need to saturate, since only one will be different than zero each time
    "vadd.u8 d2,d2,d3\n\t"
    "vadd.u8 d4,d4,d5\n\t"
    "vadd.u8 d2,d2,d4\n\t"   // Leave the result in d2

来源：https://stackoverflow.com/questions/22158186/arm-neon-how-to-implement-a-256bytes-look-up-table

标签

optimization

assembly

arm

neon