Coding Practices which enable the compiler/optimizer to make a faster program

后端未结

关注

 30  1784

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it woul

相关标签:

30条回答

感情败类

2020-12-02 04:14

A dumb little tip, but one that will save you some microscopic amounts of speed and code.

Always pass function arguments in the same order.

If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).

The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.

While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-12-02 04:15

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.

However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-02 04:15

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say

Assume that the program has been written correctly, compiled with full optimization, tested and put into production.

In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.

If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.

After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-12-02 04:17
Generic Optimizations

Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.

Declare small functions as inline or macros

Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)

Remove dead and redundant code

If the code isn't used or does not contribute to the program's result, get rid of it.

Simplify design of algorithms

I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.

Loop Unrolling

Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.

Edit: provide an example of loop unrolling Before:
```
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}
```
After unrolling:
```
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
    sum += *buffer++; // 1
    sum += *buffer++; // 2
    sum += *buffer++; // 3
    sum += *buffer++; // 4
    sum += *buffer++; // 5
    sum += *buffer++; // 6
    sum += *buffer++; // 7
    sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
    sum += *buffer++;
}
```
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.

I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.

Reduction of if statements

Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
```
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
```
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.

Example:
```
struct Reader_Interface
{
  virtual bool  write(unsigned int value) = 0;
};

struct Rectangle
{
  unsigned int origin_x;
  unsigned int origin_y;
  unsigned int height;
  unsigned int width;

  bool  write(Reader_Interface * p_reader)
  {
    bool status = false;
    if (p_reader)
    {
       status = p_reader->write(origin_x);
       status = status && p_reader->write(origin_y);
       status = status && p_reader->write(height);
       status = status && p_reader->write(width);
    }
    return status;
};
```
Factor Variable Allocation outside of loops

If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.

Factor constant expressions outside of loops

If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.

I/O in blocks

Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
```
static const char  Menu_Text[] = "\n"
    "1) Print\n"
    "2) Insert new customer\n"
    "3) Destroy\n"
    "4) Launch Nasal Demons\n"
    "Enter selection:  ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
```
The efficiency of this technique can be visually demonstrated. :-)

Don't use printf family for constant data

Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.

Format to memory, then write

Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.

Declare constant text (string literals) as static const

When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.

Lastly, Code like the compiler would

Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-02 04:18

Align your data to native/natural boundaries.

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-02 04:20

i use intel compiler. on both Windows and Linux.

when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.

if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.

so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

0 讨论(0)
发布评论:

提交评论
- 加载中...

Coding Practices which enable the compiler/optimizer to make a faster program

Generic Optimizations

Declare small functions as `inline` or macros

Remove dead and redundant code

Simplify design of algorithms

Loop Unrolling

Reduction of `if` statements

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Factor Variable Allocation outside of loops

Factor constant expressions outside of loops

I/O in blocks

Don't use `printf` family for constant data

Format to memory, then write

Declare constant text (string literals) as `static const`

Lastly, Code like the compiler would

Coding Practices which enable the compiler/optimizer to make a faster program

Generic Optimizations

Declare small functions as inline or macros

Remove dead and redundant code

Simplify design of algorithms

Loop Unrolling

Reduction of if statements

Boolean Arithmetic (Edited: applied code format to code fragment, added example)

Factor Variable Allocation outside of loops

Factor constant expressions outside of loops

I/O in blocks

Don't use printf family for constant data

Format to memory, then write

Declare constant text (string literals) as static const

Lastly, Code like the compiler would

Declare small functions as `inline` or macros

Reduction of `if` statements

Don't use `printf` family for constant data

Declare constant text (string literals) as `static const`