Optimizing RGBA8888 to RGB565 conversion with NEON

蹲街弑〆低调 提交于 2019-12-03 06:54:56

Here you are, hand-optimized NEON implementation ready for XCode :

/* IT DOESN'T WORK!!! USE THE NEXT VERSION BELOW.
 * BGRA2RGB565.s
 *
 * Created by Jake "Alquimista" Lee on 11. 11. 1..
 * Copyright 2011 Jake Lee. All rights reserved.
 */


    .align 2
    .globl _bgra2rgb565_neon
    .private_extern _bgra2rgb565_neon

// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);


//ARM
pDst        .req    r0
pSrc        .req    r1
count       .req    r2

//NEON
blu         .req    d16
grn         .req    d17
red         .req    d18
alp         .req    d19
rg          .req    red
gb          .req    blu

_bgra2rgb565_neon:
    pld     [pSrc]
    tst     count, #0x7
    movne   r0, #0
    bxne    lr

loop:
    pld     [pSrc, #32]
    vld4.8  {blu, grn, red, alp}, [pSrc]!
    subs    count, count, #8
    vshr.u8 red, red, #3
    vext.8  rg, grn, red, #5
    vshr.u8 grn, grn, #2
    vext.8  gb, blu, grn, #3
    vst2.8  {gb, rg}, [pDst]!
    bgt     loop

    bx      lr

This version will be many times faster than what you suggested :

  • increased cache hit rate via PLD

  • conversion to "long" not necessary

  • fewer instructions within the loop

There is still some room for optimizations though, you could modify the loop so that it converts 16 pixels per iteration instead of 8. Then you can schedule the instructions to avoid the two stalls completely (which is simply not possible in this 8/iteration version above) and benefit from NEON's dual-issue capability in addition.

I didn't do this because it would make the code hard to understand.

It's important to know what VEXT is supposed to do.

Now it's up to you. :)

I verified this code to be properly compiled under Xcode. Although I'm pretty sure it works correctly as well, I cannot guarantee this since I don't have the test environment. In case of malfunctioning, please let me know. I'll correct it accordingly then.

cya

==============================================================================

Well, here is the improved version.

Due to the nature of the VSRI instruction not allowing two operands other than the target, it was not possible to create a more robust one regarding the register assignment.

Please check the image format of your source image. (exact byte order of the elements)

If it's not B, G, R, A, which is the default and native one on iOS, your application will suffer heavily from internal conversions by iOS.

If it's absolutely not possible to change this for whatever the reason, let me know. I'll write a new version matching it.

PS : I forgot to remove the underscore at the start of the function prototype. Now it's gone.

/*
 * BGRA2RGB565.s
 *
 * Created by Jake "Alquimista" Lee on 11. 11. 1..
 * Copyright 2011 Jake Lee. All rights reserved.
 *
 * Version 1.1
 * - bug fix
 *
 * Version 1.0
 * - initial release
 */


    .align 2
    .globl _bgra2rgb565_neon
    .private_extern _bgra2rgb565_neon

// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);


//ARM
pDst        .req    r0
pSrc        .req    r1
count       .req    r2

//NEON
blu         .req    d16
grn         .req    d17
red         .req    d18
alp         .req    d19

gb          .req    grn
rg          .req    red

_bgra2rgb565_neon:
    pld     [pSrc]
    tst     count, #0x7
    movne   r0, #0
    bxne    lr

.loop:
    pld     [pSrc, #32]
    vld4.8  {blu, grn, red, alp}, [pSrc]!
    subs    count, count, #8

    vsri.8  red, grn, #5
    vshl.u8 gb, grn, #3
    vsri.8  gb, blu, #3

    vst2.8  {gb, rg}, [pDst]!
    bgt     .loop

    bx      lr

If you are on iOS or OS X, then you may be delighted to discover vImageConvert_RGBA8888toRGB565() and friends, in Accelerate.framework. This function rounds the 8-bit values to nearest 565 value.

For even better dithering, the quality of which is nearly indistinguishable from 8-bit color, try vImageConvert_AnyToAny():

vImage_CGImageFormat RGBA8888Format = 
{
    .bitsPerComponent = 8,
    .bitsPerPixel = 32,
    .bitmapInfo = kCGBitmapByteOrderDefault | kCGImageAlphaNoneSkipLast,
    .colorSpace = NULL,  // sRGB or substitute your own in
};

vImage_CGImageFormat RGB565Format = 
{
    .bitsPerComponent = 5,
    .bitsPerPixel = 16,
    .bitmapInfo = kCGBitmapByteOrder16Little | kCGImageAlphaNone,
    .colorSpace = RGBA8888Format.colorSpace,  
};


err = vImageConverterRef converter = vImageConverter_CreateWithCGImageFormat(
         &RGBA8888Format, &RGB565Format, NULL, kvImageNoFlags, &err );

err = vImageConvert_AnyToAny( converter, &src, &dest, NULL, kvImageNoFlags );

Either of these approaches will be vectorized and multithreaded for best performance.

You might want to use vld4q_u8() instead of vld4_u8() and adjust the rest of your code accordingly. It's hard to tell where the problem might be, but the assembler doesn't look too bad otherwise.

(I'm not familiar with NEON, nor deeply with the memory system of the Ipad2, but this is what we used to do with 88110 pixel-ops, which were an early precursor to today's SIMD extensions)

How big is the memory latency?

Could you hide it by unrolling the inner loop and running the NEON instructions on the "previous" values while the ARM pulls the "next" values from memory? A brief scan of the NEON manual implies you can run ARM and NEON instructions in parallel.

I don't think converting vld4_u8 to vld4q_u8 would lead to any bettering of the performance.

The code seems simple enough. I am not good at ASM and so it would take some time to look into it deeply.

The neon seems simple enough. But I am not quiet sure about r5_g6_b5 |= g16 being used instead of vorrq_u16

Please have a look at the optimization level too. As far as what I heard neon code optimization level goes to a maximum of 1. So the performance may differ when default optimization is being taken into account for both the reference code and neon code, as the level of optimization of reference by DEFAULT may be different.

I doesnt find any area in neon that can better the current code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!