Here is a primary source link to Ooura's numerical software:
http://www.kurims.kyoto-u.ac.jp/~ooura/
I have been using many of Ooura's FFTs over the years, I should send him a "domo" at the very least, and I use his real radix-4 in several iPad and iPhone applications under development. I did translate the code to operate with 32-bit single precision for performance on ARM. Looking at the assembly produced with XCode 3.2.2, it vectorizes with NEON SIMD instructions very nicely. I was half disappointed actually, as I was willing to vectorize the code a bit myself for even more performance. These optimizations cannot be had without first translating the FFT to single precision obviously.
While I have used Objective-C for many years, I actively develop using it, and even taught an object oriented programming course using it, I did not prepare such a wrapper (though I had done the same back in 1992 with a different FFT) for performance reasons.
I haven't tested FFTW against Ooura's FFT in at least 10 years, but when I did Ooura's library was faster for 1024 point real FFTs. However, it is quite possible that FFTW may do much better now -- but licensing it and cross-compiling it for ARM is inconvenient and I have always found FFTW to be far too bulky and obtrusive for my DSP needs. Apple's VecLib is very nice but unfortunately they have not ported it to iPhoneOS. I opened a feature request in BugReporter and you can too: https://bugreport.apple.com/