问题
I have performance critical code written for multiple CPUs. I detect CPU at run-time and based on that I use appropriate function for the detected CPU. So, now I have to use function pointers and call functions using these function pointers:
void do_something_neon(void);
void do_something_armv6(void);
void (*do_something)(void);
if(cpu == NEON) {
do_something = do_something_neon;
}else{
do_something = do_something_armv6;
}
//Use function pointer:
do_something();
...
Not that it matters, but I'll mention that I have optimized functions for different cpu's: armv6 and armv7 with NEON support. The problem is that by using function pointers in many places the code become slower and I'd like to avoid that problem.
Basically, at load time linker resolves relocs and patches code with function addresses. Is there a way to control better that behavior?
Personally, I'd propose two different ways to avoid function pointers: create two separate .so (or .dll) for cpu dependent functions, place them in different folders and based on detected CPU add one of these folders to the search path (or LD_LIB_PATH). The, load main code and dynamic linker will pick up required dll from the search path. The other way is to compile two separate copies of library :) The drawback of the first method is that it forces me to have at least 3 shared objects (dll's): two for the cpu dependent functions and one for the main code that uses them. I need 3 because I have to be able to do CPU detection before loading code that uses these cpu dependent functions. The good part about the first method is that the app won't need to load multiple copies of the same code for multiple CPUs, it will load only the copy that will be used. The drawback of the second method is quite obvious, no need to talk about it.
I'd like to know if there is a way to do that without using shared objects and manually loading them at runtime. One of the ways would be some hackery that involves patching code at run-time, it's probably too complicated to get it done properly). Is there a better way to control relocations at load time? Maybe place cpu dependent functions in different sections and then somehow specify what section has priority? I think MAC's macho format has something like that.
ELF-only (for arm target) solution is enough for me, I don't really care for PE (dll's).
thanks
回答1:
You may want to lookup the GNU dynamic linker extension STT_GNU_IFUNC
. From Drepper's blog when it was added:
Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.
Source: http://udrepper.livejournal.com/20948.html
Nonetheless, as others have said, I think you're mistaken about the performance impact of indirect calls. All code in shared libraries will be called via a (hidden) function pointer in the GOT and a PLT entry that loads/calls that function pointer.
回答2:
For the best performance you need to minimize the number of indirect calls (through pointers) per second and allow the compiler to optimize your code better (DLLs hamper this because there must be a clear boundary between a DLL and the main executable and there's no optimization across this boundary).
I'd suggest doing these:
- moving as much of the main executable's code that frequently calls DLL functions into the DLL. That'll minimize the number of indirect calls per second and allow for better optimization at compile time too.
- moving almost all your code into separate CPU-specific DLLs and leaving to main() only the job of loading the proper DLL OR making CPU-specific executables w/o DLLs.
回答3:
Here's the exact answer that I was looking for.
GCC's __attribute__((ifunc("resolver")))
It requires fairly recent binutils.
There's a good article that describes this extension: Gnu support for CPU dispatching - sort of...
回答4:
Lazy loading ELF symbols from shared libraries is described in section 1.5.5 of Ulrich Drepper's DSO How To (updated 2011-12-10). For ARM it is described in section 3.1.3 of ELF for ARM.
EDIT: With the STT_GNU_IFUNC extension mentioned by R. I forgot that was an extension. GNU Binutils supports that for ARM, apparently since March 2011, according to changelog.
If you want to call functions without the indirection of the PLT, I suggest function pointers or per-arch shared libraries inside which function calls don't go through PLTs (beware: calling an exported function is through the PLT).
I wouldn't patch the code at runtime. I mean, you can. You can add a build step: after compilation disassemble your binaries, find all offsets of calls to functions that have multi-arch alternatives, build table of patch locations, link that into your code. In main, remap the text segment writeable, patch the offsets according to the table you prepared, map it back to read-only, flush the instruction cache, and proceed. I'm sure it will work. How much performance do you expect to gain by this approach? I think loading different shared libraries at runtime is easier. And function pointers are easier still.
来源:https://stackoverflow.com/questions/9651104/cpu-dependent-code-how-to-avoid-function-pointers