问题
I'm working on an embedded device that does not support unaligned memory accesses.
For a video decoder I have to process pixels (one byte per pixel) in 8x8 pixel blocks. The device has some SIMD processing capabilities that allow me to work on 4 bytes in parallel.
The problem is, that the 8x8 pixel blocks aren't guaranteed to start on an aligned address and the functions need to read/write up to three of these 8x8 blocks.
How would you approach this if you want very good performance? After a bit of thinking I came up with the following three ideas:
Do all memory accesses as bytes. This is the easiest way to do it but slow and it does not work well with the SIMD capabilites (it's what I'm currently do in my reference C-code).
Write four copy-functions (one for each alignment case) that load the pixel-data via two 32-bit reads, shift the bits into the correct position and write the data to some aligned chunk of scratch memory. The video processing functions can then use 32 bit accesses and SIMD. Drawback: The CPU will have no chance to hide the memory latency behind the processing.
Same idea as above, but instead of writing the pixels to scratch memory do the video-processing in place. This may be the fastest way, but the number of functions that I have to write for this approach is high (around 60 I guess).
Btw: I will have to write all functions in assembler because the compiler generates horrible code when it comes to the SIMD extension.
Which road would you take, or do you have another idea how to approach this?
回答1:
You should first break your code into fetch/processing sections.
The fetch code should copy into a working buffer and have cases for for memory that is aligned (where you should be able to copy using the SIMD registers) and non-aligned memory where you need to copy byte by byte (if your platform can't do unaligned access, and your source/dest have different alignments, then this is the best you can do).
Your processing code can then be SIMD with the guarantee of working on aligned data. For any real degree of processing doing a copy+process will definitely be faster than non-SIMD operations on unaligned data.
Assuming your source & dest are the same, a further optimization would be to only use the working buffer if the source is unaligned, and do the processing in-place if the memory's aligned. The benefits of this will depend upon the characteristics of your data.
Depending on your architecture you may get further benefits by prefetching data before processing. This is where you can issue instructions to fetch areas of memory into the cache before they're needed, so you would issue a fetch for the next block before processing the current.
回答2:
You can use memcpy
(which if I recall can be optimized to perform word copies if possible) to copy to an aligned data structure (e.g. something allocated on the stack or from malloc
). Then perform your processing on that aligned data structure.
Most likely, though, you'd want to handle things in your processor's registers and not in memory. How you'd approach your task depends on the capabilities of the hardware (e.g. can a 32-bit register be split into four 8-bit ones? What registers do the SIMD operations operate on?) If you're going the simple route, you can have a small loader function be called which performs your unaligned read(s) for you.
回答3:
Align the data first, and then take the aligned-SIMD approach.
This is less work than option 3, and with luck your code will be top-speed 25% of the time (i.e. the already-aligned case). You can happily re-use the code in future in situations where you know the input will be properly aligned.
Only if this doesn't work to your satisfaction should you consider hardcoding all four alignment possibilities into your functions.
回答4:
I'd go with option 1) until you know that it's too slow (slow is ok, too slow is bad)
回答5:
General advice: why don't you go with something that sounds reasonable (like #2) and then measure the performance? If it's not acceptable, you can go back to the drawing board.
Surely handcrafting 60ish functions in assembler before measuring would count like "premature optimization". :)
来源:https://stackoverflow.com/questions/375259/unaligned-memory-accesses