General strategies for memory/speed problems

后端 未结 4 1743
春和景丽
春和景丽 2021-01-21 23:05

I have a c++ code which runs through about 200 ASCII files, does some basic data processing, and outputs a single ASCII file with (basically) all of the data.

The progra

相关标签:
4条回答
  • 2021-01-21 23:52

    Analyze your code with callgrind, part of the valgrind suite. You can graphically browse the results with kcachegrind. (Despite its name, it works on callgrind output too.) It's free and will give you awesome detail.

    You can also externally turn data collection off and on. So start with it off, wait until your program gets slow, turn it on during the problem time, then turn it off. You'll see where the CPU was going. If necessary, do the same thing in reverse watching only when it's fast and compare.

    Usually, the problem will stick out like a sore thumb.

    0 讨论(0)
  • 2021-01-21 23:55

    Could you share your program ?

    1. One thing to look for is whether you are using data structures that do not scale with increasing number of elements.

    e.g. using lists to hold a million elements would be extremely slow to traverse/ search (O(n)) as opposed to say using a binary search tree (nlog(n)) or a hashing (O(1)).

    2. You should look at whether you are holding on to the data at the end of each cycle (/burn/run) . Ideally you should release all the resources at the end of each run.

    3. Sounds like there may be a handle leak ?

    0 讨论(0)
  • 2021-01-21 23:56

    This is a total shot in the dark. You've got:

    bool getDirectoryContents(const string dirName, vector<string> *conts) {
        ...
        copy(directory_iterator(p), directory_iterator(), back_inserter(v));
    

    How does the performance change if you instead make that:

    bool getDirectoryContents(const string dirName, vector<string> *conts) {
        ...
        // note: preincrementing the iterator
        for (directory_iterator it((p)); it!=directory_iterator(); ++it) {
           v.push_back(*it);
        }
    

    My thought is that std::copy is specified to use postincrement. And boost::filesystem::directory_iterator is an InputIterator: it shouldn't really support postincrement. boost::filesystem::directory_iterator may not be happy being postincremented.

    0 讨论(0)
  • 2021-01-22 00:04

    Without more information to go on, I'd guess that what you're dealing with is a Schlemiel the Painter's algorithm: (Original) (Wikipedia). They're incredibly easy to fall into doing string processing. Let me give you an example.

    I want to read every line in a file, process each line somehow, run it through some intermediate processing. Then I want to gather up the the results, and maybe write it back to disk. Here's a way to do that. I make a single huge mistake that can be easy to miss:

    // proc.cpp
    class Foo
    {
      public:
      std::string chew_on(std::string const& line_to_chew_on) {...}
      ...
    };
    
    Foo processor;
    std::string buffer;
    
    // Read/process
    FILE *input=fopen(..., "r");
    char linebuffer[1000+1];
    for (char *line=fgets(linebuffer, 1000, input); line; 
         line=fgets(linebuffer, 1000, input) ) 
    {
        buffer=buffer+processor.chew_on(line);  //(1)
    }
    fclose(input);
    
    // Write
    FILE *output=fopen(...,"w");
    fwrite(buffer.data(), 1, buffer.size(), output);
    fclose(output);
    

    The problem here which can be easy to miss at first glance, is that each time line (1) is run, the entire contents of buffer is copied. If there are 1000 lines with 100 characters each, you end up spending time copying 100+200+300+400+....+100,000=5,050,000 byte copies to run this. Increase to 10,000 lines? 500,500,000. That paint can is getting further and further away.

    In this particular example, the fix is easy. Line (1) should read:

        buffer.append(processor.chew_on(line)); // (2)
    

    or equivalently: (thanks Matthieu M.):

        buffer += processor.chew_on(line);
    

    This manages to help because (usually) std::string won't need to make a full copy of buffer to perform the append function, whereas in (1), we're insisting that a copy be made.

    More generally, suppose (a) the processing you're doing keeps state, (b) you reference all or most of it often, and (c) that state grows over time. Then there's a fair to good chance that you've written a Θ(n2) time algorithm, which will exibit exactly the type of behavior you're talking about.


    Edit

    Of course, the stock answer to "why is my code slow?" is "run a profile." There are a number of tools and techniques for doing this. Some options include:

  • callgrind/kcachegrind (as suggested by David Schwartz)
  • Random Pausing (as suggested by Mike Dunlavey)
  • The GNU profiler, gprof
  • The GNU test coverage profiler, gcov
  • oprofile

    They've all got their strengths. "Random Pausing" is probably the simplest to implement, though it can be hard to interpret the results. 'gprof' and 'gcov' are basically useless on multithreaded programs. Callgrind is thorough but slow, and can sometimes play strange tricks on multithreaded programs. oprofile is fast, plays nicely with multithreaded programs, but can be difficult to use, and can miss things.

    However, if you're trying to profile a single threaded program, and are developing with the GNU toolchain, gprof can be a wonderful option. Take my proc.cpp, above. For purposes of demonstration, I'm going to profile an unoptimized run. First, I rebuild my program for profiling (adding -pg to the compile and link steps):

    $ g++ -O0 -g -pg -o proc.o -c proc.cpp
    $ g++ -pg -o proc proc.o
    

    I run the program once to create profiling information:

    ./proc
    

    In addition to doing whatever it would normally do, this run will create a file called 'gmon.out' in the current directory. Now, I run gprof to interpret the result:

    $ gprof ./proc
    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls  ms/call  ms/call  name    
    100.50      0.01     0.01   234937     0.00     0.00  std::basic_string<...> std::operator+<...>(...)
      0.00      0.01     0.00   234937     0.00     0.00  Foo::chew_on(std::string const&)
      0.00      0.01     0.00        1     0.00    10.05  do_processing(std::string const&, std::string const&)
    ...
    

    Yes indeed, 100.5% of my program's time is spent in std::string operator+. Well, ok, up to some sampling error. (I'm running this in a VM ... it seems that the timing being captured by gprof is off. My program took much longer than 0.01 cumulative seconds to run...)

    For my very simple example, gcov is a little less instructive. But here's what it happens to show. First, compile and run for gcov:

    $ g++ -O0 -fprofile-arcs -ftest-coverage -o proc proc.cpp
    $ ./proc
    $ gcov ./proc
    ...
    

    This creates a bunch of files ending in .gcno, .gcda, .gcov in the current directory. The files in .gcov tell us how many times each line of code was executed during the run. So, in my example, my proc.cpp.gcov ends up looking like this:

            -:    0:Source:proc.cpp
            -:    0:Graph:proc.gcno
            -:    0:Data:proc.gcda
            -:    0:Runs:1
            -:    0:Programs:1
            -:    1:#include 
            -:    2:#include 
            -:    4:class Foo
            -:    5:{
            -:    6:  public:
       234937:    7:  std::string chew_on(std::string const& line_to_chew_on) {return line_to_chew_on;}
            -:    8:};
            -:    9:
            -:   10:
            -:   11:
            1:   12:int do_processing(std::string const& infile, std::string const& outfile)
            -:   13:{
            -:   14:  Foo processor;
            2:   15:  std::string buffer;
            -:   16:
            -:   17:  // Read/process
            1:   18:  FILE *input=fopen(infile.c_str(), "r");
            -:   19:  char linebuffer[1000+1];
       234938:   20:  for (char *line=fgets(linebuffer, 1000, input); line; 
            -:   21:       line=fgets(linebuffer, 1000, input) ) 
            -:   22:    {
       234937:   23:      buffer=buffer+processor.chew_on(line);  //(1)
            -:   24:    }
            1:   25:  fclose(input);
            -:   26:
            -:   27:  // Write
            1:   28:  FILE *output=fopen(outfile.c_str(),"w");
            1:   29:  fwrite(buffer.data(), 1, buffer.size(), output);
            1:   30:  fclose(output);
            1:   31:}
            -:   32:
            1:   33:int main()
            -:   34:{
            1:   35:  do_processing("/usr/share/dict/words","outfile");
            -:   36:}
    

    So from this, I'm going to have to conclude that the std::string::operator+ at line 23 (which is executed 234,937 times) is a potential cause of my program's slowness.

    As an aside, callgrind/kcachegrind work with multithreaded programs, and can provide much, much more information. For this program I run:

    g++ -O0 -o proc proc.cpp
    valgrind --tool=callgrind ./proc  # this takes forever to run
    kcachegrind callgrind.out.*
    

    And I find the following output, showing that what's really eating up my cycles is lots and lots of memory copies (99.4% of execution time spent in __memcpy_ssse3_back), which I can see all happen somewhere below line 23 in my source: kcachegrind screenshot

0 讨论(0)
提交回复
热议问题