I\'m looking for an extremely fast atof() implementation on IA32 optimized for US-en locale, ASCII, and non-scientific notation. The windows multithreaded CRT falls down misera
Have you considered looking into having the GPU do this work? If you can load the strings into GPU memory and have it process them all you may find a good algorithm that will run significantly faster than your processor.
Alternately, do it in an FPGA - There are FPGA PCI-E boards that you can use to make arbitrary coprocessors. Use DMA to point the FPGA at the part of memory containing the array of strings you want to convert and let it whizz through them leaving the converted values behind.
Have you looked at a quad core processor? The real bottleneck in most of these cases is memory access anyway...
-Adam