The problem seems quite suited for a GPU, FPGA, etc. (because it\'s quite parallel); but I\'m looking for a CPU-based and somewhat architecture independent solution right no