I was wondering if anyone had any hard numbers on ARM vs Thumb code performance on iPhone 3GS. Specifically for non-floating point (VFP or NEON) code - I\'m aware of the issues
I dont know about the iPhone but a blanket statement that thumb is slower than ARM is not correct at all. Given 32 bit wide zero wait state memory, thumb will be a little slower, numbers like 5% or 10%. Now if it is thumb2 that is a different story, it is said that thumb2 can run faster, I dont know what the iPhone has my guess is that it is not thumb2.
If you are not running out of zero-wait-state 32 bit memory then your results will vary. One big thing is 32 bit wide memory. If you are running on a 16 bit wide bus like the GameBoy Advance family, and there are some wait states on that memory or ROM, then thumb can easily out run ARM for performance even though it takes more thumb instructions to perform the same task.
Test your code! It is not hard to invent a test that provides the results you are interested in or not. It is as easy to show arm blows away thumb as it is thumb blows away arm. Who cares what the dhrystones are, it is how fast does it run YOUR code TODAY that matters.
What I have found over the years in testing code performance for ARM is that your code and your compiler are the big factor. So thumb is a few percent slower in theory because it uses a few percent more instructions to peform the same task. But did you know that your favorite compiler could be horrible and by simply switch compilers you could run several times faster (gcc falls into that category)? Or using the same compiler and mixing up the optimization options. Either way you can shadow the arm / thumb difference by being smart about using the tools. You probably know this but you would be surprised to know how many people think that the one way they know how to compile code is the only way and the only way to get better performance is throw more memory or other hardware at the problem.
If you are on the iPhone I hear those folks are using LLVM? I like the llvm concept in many ways and am eager to use it as my daily driver when it matures, but found it to produce code that was 10-20% (or much more) slower for the particular task I was doing. I was in arm mode, I did not try thumb mode, and I had an l1 and l2 cache on. Had I tested without the caches to truly compare thumb to arm I would probably see thumb a few percent slower, but if you think of it (which I wasnt interested in at the time) you can cache twice as much thumb code than arm code which MIGHT imply that even though there is a few percent more code overall for the task, by caching significantly more of it and reducing the average fetch time thumb can be noticeably faster. I may have to go try that.
If you are using llvm, you have the other problem of multiple places to perform optimizations. Going from C to bytecode you can optimize, you can then optimize the bytecode itself, you can then merge all of your bytecode and optimize that as a whole, then when going from byte code to assembler you can optimize. If you had only 3 source files, and assumed there were only two optimization levels per opportunity, those being dont optimize or do optimize, with gcc you would have 8 combinations to test, with llvm the number of experiments is almost an order of magnitude higher. More than you can really run, hundreds to thousands. For the one test I was running, NOT opimizing on the C to bytecode step, then NOT optimizing the bytecode while separate, but optimizing after merging the bytecode files into one big(ger) one. The having llc optimize on the way to arm produced the best results.
Bottom line...test,test,test.
EDIT:
I have been using the word bytecode, I think the correct term is bitcode in the LLVM world. The code in the .bc files is what I mean...
If you are going from C to ARM using LLVM, there is bitcode (bc) in the middle. There are command line options for optimizing on the C to bc step. Once bc you can optimize per file, bc to bc. If you choose you can merge two or more bc files into bigger bc files, or just turn all the files into one big bc file. Then each of these combined files can also be optimized.
My theory, that only has a couple of test cases behind it so far, is that if you do not do any optimization until you have the entire program/project in one big bc file, the optimizer has the maximum amount if information with which to do its job. So that means go from C to bc with no optimization. Then merge all the bc files into one big bc file. Once you have the whole thing as one big bc file then let the optimizer perform its optimization step, maximizing the information and hopefully quality of the optimization. Then go from the optimized bc file to ARM assembler. The default setting for llc is with optimization on, you do want to allow that optimization as it is the only step that knows how to optimize for the target. The bc to bc optimizations are generic and not target specific (AFAIK).
You still have to test, test, test. Go ahead and experiment with optimizations between the steps, see if it makes your program run faster or slower.
Thumb code will essentially always be slower than equivalent ARM. The one case where Thumb code can be a big performance win is if it makes the difference between your code fitting into on-chip memory or cache.
It's hard to give exact numbers on performance differences, because it's entirely dependent on what your code actually does.
You can set per-architecture compiler flags in XCode, which would avoid breaking the simulator build. See the XCode build setting documentation.
See this PDF from ARM/Thumb for performance/code size/power consumption trade offs.
Profile Guided Selection of ARM and Thumb
Instructions
- Department of Computer Science, The University of Arizona by Rajiv Gupta