I\'m calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.
This is done by just setting the numerator to 1, i.
A couple of ideas for you, though none that solve your problem directly as stated.
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.
Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.