问题
I came from this thread: FLOPS Intel core and testing it with C (innerproduct)
As I began writing simple test scripts, a few questions came into my mind.
Why floating point? What is so significant about floating point that we have to consider? Why not a simple int?
If I want to measure FLOPS, let say I am doing the inner product of two vectors. Must the two vectors be float[] ? How will the measurement be different if I use int[]?
I am not familiar with Intel architectures. Let say I have the following operations:
float a = 3.14159; float b = 3.14158; for(int i = 0; i < 100; ++i) { a + b; }
How many "floating point operations" is this?
I am a bit confused because I studied a simplified 32bit MIPS architecture. For every instruction, there is 32 bits, like 5 bit for operand 1 and 5 bit for operand 2 etc. so for intel architectures (specifically the same architecture from the previous thread), I was told that the register can hold 128 bit. For SINGLE PRECISION floating point, 32bit per float point number, does that mean for each instruction fed to the processor, it can take 4 floating point numbers? Don't we also have to account for bits involved in operands and other parts of the instruction? How can we just feed 4 floating point numbers to a cpu without any specific meaning to this?
I don't know whether my approach of thinking everything in bits and pieces make sense. If not, what "height" of perspective should I be looking at?
回答1:
1.) Floating point operations simply represent a wider range of math than fixed-width integers. Additionally, heavily numerical or scientific applications (which would typically be the one who actually test a CPU's pure computational power) probably rely on Floating point ops more than anything.
2.) They would have to both be float. The CPU won't add an integer and a float, one or the other would implicitly be converted (most likely the integer would be converted to the float ), so it would still just be floating point operations.
3.) That would be 100 floating point operations, as well as 100 integer operations, as well as some (100?) control-flow/branch/comparison operations. There'd generally also be loads and stores but you don't seem to be storing the value :)
4.) I'm not sure how to begin with this one, you seem to have a general perspective on the material, but you have confused some of the details. Yes an individual instruction may be partitioned into sections similar to:
|OP CODE | Operand 1 | Operand 2 | (among many, many others)
However, operand 1 and operand 2 don't have to contain the actual values to be added. They could just contain the registers to be added. For example take this SSE instruction:
mulps %%xmm3, %%xmm1
It's telling the execution unit to multiply the contents of register xmm3 and the contents of xmm1 and store the result in xmm3. Since the registers hold 128-bit values, I'm doing the operation on 128-bit values, this is independent of the size of the instruction. Unfortunately x86 does not have a similar instruction breakdown as MIPS due to it being a CISC architecture. An x86 instruction can have anywhere between 1 and 16(!) bytes.
As for your question, I think this is all very fun stuff to know, and it helps you build intuition about the speed of math-intensive programs, as well as giving you a sense of upper limits to be achieved when optimizing. I'd never try and directly correlate this to the actual run time of a program though, as too many other factors contribute to the actual end performance.
回答2:
Floating point and integer operation use different pipelines on the chip, so they run at different speeds (on simple/old enough architectures there may be no native floating point support at all, making floating point operation very slow). So if you are trying to estimate real world performance for problems that use floating point math, you need to know how fast these operation are.
Yes, you must use floating point data. See #1.
A FLOP is typically defined as an average over a particular mixture of operations that is intended to be representative of the real world problem you want to model. For your loop, you would just count each addition as 1 operation making a total of 100 operations. BUT: this is not representative of most real world jobs and you may have to take steps to prevent the compiler from optimizing all the work out.
Vectorized or SIMD (Single Instruction Multiple Data) can do exactly that. Example of SIMD systems in use right now include AltiVec (on PowerPC series chips) and MMX/SSE/... on Intel x86 and compatible. Such improvements in chips should get credit for doing more work, so your trivial loop above would still be counted as 100 operation even if there are only 25 fetch and work cycles. Compilers either need to be very smart, or receive hints from the programmer to make use of SIMD units (but most front-line compilers are very smart these days).
回答3:
Floating Point Operations per Second.
http://www.webopedia.com/TERM/F/FLOPS.html
Your example is 100 floating point operations (adding the two floating point numbers together is one floating point operation). Allocating floating point numbers may or may not count.
The term is apparently not an exact measurement, as it is clear that a double-precision floating-point operation is going to take longer than a single-precision one, and multiplication and division are going to take longer than addition and subtraction. As the Wikipedia article attests, there are ultimately better ways to measure performance.
回答4:
1) Because many real world application runs crunching a lot of floating point numbers, by example all vector based apps (games, CAD, etc) relies almost entirely in floating point operations.
2) FLOPS is for Floating Point operations.
3) 100. The flow control use integer operations
4) That architecture is best suited for ALU. Floating point representations can use 96-128 bits.
回答5:
Floating point operations are the limiting factor in certain computing problems. If your problem isn't one of them, you can safely ignore flops ratings.
Intel architecture started out with simple 80 bit floating point instructions, which can load or store to 64 bit memory locations with rounding. Later they added the SSE instructions, which use 128 bit registers and can do multiple floating point operations with a single instruction.
回答6:
Yuck, simplified MIPS. Typically, that's fine for intro courses. I'm going to assume a hennesy/patterson book?
Read up on the MMX instructions for the Pentium architecture(586) for the Intel approach. Or, more generally, study the SIMD architectures, which are also known as vector processor architectures. They were first popularized by the Cray supercomputers(although I think there were a few forerunners). For a modern SIMD approach, see the CUDA approach produced by NVIDIA or the different DSP processors on the market.
回答7:
- Floating point speed mattered a lot for scientific computing and computer graphics.
- By definition, no. You're testing integer performance at that point.
- 302, see below.
- x86 and x64 are very different from MIPS. MIPS, being a RISC (reduced instruction set computer) architecture, has very few instructions in comparison to the CISC (complex instruction set computer) architecture of Intel and AMD's offerings. For instruction decoding, x86 using variable width instructions, so instructions anywhere from one to 16 bytes in length (including prefixes, it might be larger)
The 128 bit thing is about the internal representation of floats in the processor. It uses really bit floats internally to try and avoid rounding errors, and then truncates them when you put the numbers back into memory.
fld A //st=[A]
fld B //st=[B, A]
Loop:
fld st(1) //st=[A, B, A]
fadd st(1) //st=[A + B, B, A]
fstp memory //st=[B, A]
回答8:
There are lots of things floating point math does far better than integer math. Most university computer science curricula have a course on it called "numerical analysis".
The vector elements must be float, double, or long double. The inner product calculation will be slower than if the elements were ints.
That would be 100 floating point adds. (That is, unless the compiler realized nothing is ever done with the result and optimizes the whole thing away.)
Computers use a variety of internal formats to represent floating point numbers. In the example you mention, the CPU would convert the 32-bit float into its internal 128-bit format before doing operations on the number.
In addition to uses other answers have mentioned, people called "quants" use floating point math for finance these days. A guy named David E. Shaw started applying floating point math to modeling Wall Street in 1988, and as of Sept. 30, 2009, is worth $2.5 billion and ranks #123 on the Forbes list of the 400 richest Americans.
So it's worth learning a bit about floating point math!
回答9:
1) Floating point is important because sometimes we want to represent really big or really small numbers and integers aren't really so good with that. Read up on the IEEE-754 standard, but the mantissa is like the integer portion, and we trade some bits to work as an exponent, which allows a much more expanded range of numbers to be represented.
2) If the two vectors are ints, you won't measure FLOPS. If one vector is int and another is float, you'll be doing lots of int->float conversions, and we should probably consider such a conversion to be a FLOP.
3/4) Floating point operations on Intel architectures are really quite exotic. It's actually a stack-based, single operand instruction set (usually). For instance, in your example, you would use one instruction with an opcode that loads a memory operand onto the top of the FPU stack, and then you would use another instruction with an opcode that adds a memory operand to the top of the FPU stack, and then finally another instruction with an opcode that pops the top of the FPU stack to the memory operand.
This website lists a lot of the operations.
http://www.website.masmforum.com/tutorials/fptute/appen1.htm
I'm sure Intel publishes the actual opcodes somewhere, if you're really that interested.
来源:https://stackoverflow.com/questions/1541725/flops-what-really-is-a-flop