How many FLOPS for FFT?

问题

I would like to know how many FLOPS a Fast Fourier Transform (FFT) performs.

So, if I have a 1 dimensional array of N float numbers and I would like to calculate the FFT of this set of numbers, how many FLOPS need to be performed?

I know that this depends on the used algorithm, but what about the fastest available?

I also know that the scaling of a FFT is of the order of N*log(N) but this would not answer my question.

回答1:

That depends on implementation. Fastest does not necessary mean lowest FLOP nor highest FLOPS. The speed is often achieved by exploiting HW architecture rather than lowering FLOP. There are too many implementations out there so your question without actual code and architecture is unanswerable.

I like precomputed W matrix implementations as I usually use FFT for single resolution matrices many times so no need to compute W more then once per resolution. That can cut down FLOP per recursion layer significantly.

For example this DFFTcc has 14 FLOP per iteration using only +,-,* operations. Assuming 1D FFT case N=8 and using basic data-type if I did not make any silly mistake:

FLOP = 8*14 + (4+4)*14 +(2+2+2+2+2)*14 +(1+1+1+1+1+1+1+1)*2 = 14*N*log2(N) + 2*N = 352

If you use Real input/output you can even lower that for first/last recursion layer. But simple FLOP count is not enough as some operations are more complicated then others. And also FLOP are not the only thing that affect speed.

Now to get the FLOPS just measure time [s] the FFT takes:

FLOPS = FLOP/time

回答2:

You can estimate flops-performance at the FFTW benchmark page. Slightly outdated but contains results for the most effective FFT implementations.

It seems that rough estimate is about 5000 MFlops for 3.0 GHz Intel Xeon Core Duo

回答3:

The "fastest available" is not only very processor dependent, but likely to use a completely different algorithm my test. But I counted the flops for a bog-simple non-recursive in-place decimation-in-time radix-2 FFT taken right out of an old ACM algorithms textbook for an FFT of length 1024, and got 20480 fmuls and 30720 fadds (this was using a pre-computed twiddle factor table, thus the transcendental function computations were not included in the flop counts). But note that this code additionally used a ton of integer array index calculations, sine table lookups, and data moves that probably took a lot more of the CPUs cycles than did the FPU. Much larger FFTs would likely also incur a large amount of additional data cache misses and other memory latency penalties. It's possible in that situation to speed up the code by adding more FLOPs in exchange for reduced memory hierarchy latency penalties. So, YMMV.

回答4:

As underlined by Spektre, the actual FLOPS (Floating Point OPerations per Second) depend on the particular hardware and implementation and higher FLOP (Floating Point OPeration) algorithms may correspond to lower FLOPS implementations, just because with such implementations you can more effectively exploit the hardware.

If you want to compute the number of floating point operations for a Decimation In Time radix-2 approach, the you can refer to the following figure:

Let N the length of the sequence to be transformed. There is an overall number of log2N stages and each stage contains N/2 butterflies. Let us then consider the generic butterfly:

Let us rewrite the output of the generic butterfly as

E(i + 1) = E(i) + W * O(i)
O(i + 1) = E(i) - W * O(i)

A butterfly thus involves one complex multiplication and two complex additions. On rewriting the above equations in terms of real and imaginary parts, we have

real(E(i + 1)) = real(E(i)) + (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(E(i + 1)) = imag(E(i)) + (real(W) * imag(O(i)) + imag(W) * real(O(i)))

real(O(i + 1)) = real(O(i)) - (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(O(i + 1)) = imag(O(i)) - (real(W) * imag(O(i)) + imag(W) * real(O(i)))

Accordingly, we have

4 multiplications

real(W) * real(O(i)), 
imag(W) * imag(O(i)), 
real(W) * imag(O(i)), 
imag(W) * real(O(i)).

6 sums

real(W) * real(O(i)) – imag(W) * imag(O(i))     (1)
real(W) * imag(O(i)) + imag(W) * real(O(i))     (2)
real(E(i)) + eqn.1
imag(E(i)) + eqn.2
real(E(i)) – eqn.1
imag(E(i)) – eqn.2

Therefore, the number of operations for the Decimation In Time radix 2 approach are

2N * log2(N) multiplications
3N * log2(N) additions

These operation counts may change if the multiplications are differently arranged, see Complex numbers product using only three multiplications.

The same results apply to the case of Decimation in Frequency radix 2, see figure

来源：https://stackoverflow.com/questions/40036629/how-many-flops-for-fft

标签

algorithm

fft

flops