I have a c++ code which runs through about 200 ASCII files, does some basic data processing, and outputs a single ASCII file with (basically) all of the data.
The progra
Analyze your code with callgrind, part of the valgrind suite. You can graphically browse the results with kcachegrind. (Despite its name, it works on callgrind output too.) It's free and will give you awesome detail.
You can also externally turn data collection off and on. So start with it off, wait until your program gets slow, turn it on during the problem time, then turn it off. You'll see where the CPU was going. If necessary, do the same thing in reverse watching only when it's fast and compare.
Usually, the problem will stick out like a sore thumb.
Could you share your program ?
e.g. using lists to hold a million elements would be extremely slow to traverse/ search (O(n)) as opposed to say using a binary search tree (nlog(n)) or a hashing (O(1)).
2. You should look at whether you are holding on to the data at the end of each cycle (/burn/run) . Ideally you should release all the resources at the end of each run.
3. Sounds like there may be a handle leak ?
This is a total shot in the dark. You've got:
bool getDirectoryContents(const string dirName, vector<string> *conts) {
...
copy(directory_iterator(p), directory_iterator(), back_inserter(v));
How does the performance change if you instead make that:
bool getDirectoryContents(const string dirName, vector<string> *conts) {
...
// note: preincrementing the iterator
for (directory_iterator it((p)); it!=directory_iterator(); ++it) {
v.push_back(*it);
}
My thought is that std::copy
is specified to use postincrement. And boost::filesystem::directory_iterator
is an InputIterator: it shouldn't really support postincrement. boost::filesystem::directory_iterator
may not be happy being postincremented.
Without more information to go on, I'd guess that what you're dealing with is a Schlemiel the Painter's algorithm: (Original) (Wikipedia). They're incredibly easy to fall into doing string processing. Let me give you an example.
I want to read every line in a file, process each line somehow, run it through some intermediate processing. Then I want to gather up the the results, and maybe write it back to disk. Here's a way to do that. I make a single huge mistake that can be easy to miss:
// proc.cpp
class Foo
{
public:
std::string chew_on(std::string const& line_to_chew_on) {...}
...
};
Foo processor;
std::string buffer;
// Read/process
FILE *input=fopen(..., "r");
char linebuffer[1000+1];
for (char *line=fgets(linebuffer, 1000, input); line;
line=fgets(linebuffer, 1000, input) )
{
buffer=buffer+processor.chew_on(line); //(1)
}
fclose(input);
// Write
FILE *output=fopen(...,"w");
fwrite(buffer.data(), 1, buffer.size(), output);
fclose(output);
The problem here which can be easy to miss at first glance, is that each time line (1)
is run, the entire contents of buffer
is copied. If there are 1000 lines with 100 characters each, you end up spending time copying 100+200+300+400+....+100,000=5,050,000 byte copies to run this. Increase to 10,000 lines? 500,500,000. That paint can is getting further and further away.
In this particular example, the fix is easy. Line (1)
should read:
buffer.append(processor.chew_on(line)); // (2)
or equivalently: (thanks Matthieu M.):
buffer += processor.chew_on(line);
This manages to help because (usually) std::string
won't need to make a full copy of buffer
to perform the append
function, whereas in (1)
, we're insisting that a copy be made.
More generally, suppose (a) the processing you're doing keeps state, (b) you reference all or most of it often, and (c) that state grows over time. Then there's a fair to good chance that you've written a Θ(n2) time algorithm, which will exibit exactly the type of behavior you're talking about.
Of course, the stock answer to "why is my code slow?" is "run a profile." There are a number of tools and techniques for doing this. Some options include:
They've all got their strengths. "Random Pausing" is probably the simplest to implement, though it can be hard to interpret the results. 'gprof' and 'gcov' are basically useless on multithreaded programs. Callgrind is thorough but slow, and can sometimes play strange tricks on multithreaded programs. oprofile is fast, plays nicely with multithreaded programs, but can be difficult to use, and can miss things.
However, if you're trying to profile a single threaded program, and are developing with the GNU toolchain, gprof can be a wonderful option. Take my proc.cpp, above. For purposes of demonstration, I'm going to profile an unoptimized run. First, I rebuild my program for profiling (adding -pg
to the compile and link steps):
$ g++ -O0 -g -pg -o proc.o -c proc.cpp
$ g++ -pg -o proc proc.o
I run the program once to create profiling information:
./proc
In addition to doing whatever it would normally do, this run will create a file called 'gmon.out' in the current directory. Now, I run gprof to interpret the result:
$ gprof ./proc Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 100.50 0.01 0.01 234937 0.00 0.00 std::basic_string<...> std::operator+<...>(...) 0.00 0.01 0.00 234937 0.00 0.00 Foo::chew_on(std::string const&) 0.00 0.01 0.00 1 0.00 10.05 do_processing(std::string const&, std::string const&) ...
Yes indeed, 100.5% of my program's time is spent in std::string operator+
. Well, ok, up to some sampling error. (I'm running this in a VM ... it seems that the timing being captured by gprof is off. My program took much longer than 0.01 cumulative seconds to run...)
For my very simple example, gcov is a little less instructive. But here's what it happens to show. First, compile and run for gcov:
$ g++ -O0 -fprofile-arcs -ftest-coverage -o proc proc.cpp
$ ./proc
$ gcov ./proc
...
This creates a bunch of files ending in .gcno
, .gcda
, .gcov
in the current directory. The files in .gcov
tell us how many times each line of code was executed during the run. So, in my example, my proc.cpp.gcov
ends up looking like this:
-: 0:Source:proc.cpp -: 0:Graph:proc.gcno -: 0:Data:proc.gcda -: 0:Runs:1 -: 0:Programs:1 -: 1:#include -: 2:#include -: 4:class Foo -: 5:{ -: 6: public: 234937: 7: std::string chew_on(std::string const& line_to_chew_on) {return line_to_chew_on;} -: 8:}; -: 9: -: 10: -: 11: 1: 12:int do_processing(std::string const& infile, std::string const& outfile) -: 13:{ -: 14: Foo processor; 2: 15: std::string buffer; -: 16: -: 17: // Read/process 1: 18: FILE *input=fopen(infile.c_str(), "r"); -: 19: char linebuffer[1000+1]; 234938: 20: for (char *line=fgets(linebuffer, 1000, input); line; -: 21: line=fgets(linebuffer, 1000, input) ) -: 22: { 234937: 23: buffer=buffer+processor.chew_on(line); //(1) -: 24: } 1: 25: fclose(input); -: 26: -: 27: // Write 1: 28: FILE *output=fopen(outfile.c_str(),"w"); 1: 29: fwrite(buffer.data(), 1, buffer.size(), output); 1: 30: fclose(output); 1: 31:} -: 32: 1: 33:int main() -: 34:{ 1: 35: do_processing("/usr/share/dict/words","outfile"); -: 36:}
So from this, I'm going to have to conclude that the std::string::operator+ at line 23 (which is executed 234,937 times) is a potential cause of my program's slowness.
As an aside, callgrind/kcachegrind work with multithreaded programs, and can provide much, much more information. For this program I run:
g++ -O0 -o proc proc.cpp
valgrind --tool=callgrind ./proc # this takes forever to run
kcachegrind callgrind.out.*
And I find the following output, showing that what's really eating up my cycles is lots and lots of memory copies (99.4% of execution time spent in __memcpy_ssse3_back
), which I can see all happen somewhere below line 23 in my source: