I am investigating a crash due to heap corruption. As this issue is non-trivial and involves analyzing the stack and dump results, I have decided to do a code review of file
Welcome to hell. There is no easy solution so I will only provide some pointers.
Try to reproduce the bug in a debug environement. Debuggers can pad your heap allocations with bound checks and will tell you if you wrote in those bound checks. Also, it will consistently allocate memory using the same virtual addresses, making reproductibility easier.
In that case, you can try an analyser tool such as Purify. They will detect pretty much anything nasty that your code is doing but will also run VERY slowly. Such a tool will detect out of bound memory access, freed memory access, trying to free twice the same block, using the wrong allocator/deallocators, etc... Those are all kind of conditions that can stay latent for very long and only crash at the most inopportune moment.
Check out the answers to this related question.
The answer I suggested provides a technique which may be able to get you back to the code that is actually causing the heap corruption. My answer describes the technique using gdb
but I'm sure you must be able to do something similar on windows.
The principle at least should be the same.
have you thought isolating the source of the corruption using gflags? Once you have a dump (or breaking debugger -> WinDBG) you could see where the corruption is caused more precisely.
Here is some gflag examples: http://blogs.msdn.com/b/webdav_101/archive/2010/06/22/detecting-heap-corruption-using-gflags-and-dumps.aspx
Cheers, Seb
Common scenarios include:
char *stuff = new char[10]; stuff[10] = 3;
)[EDIT] From the comments, a few more:
The most difficult memory corruption bug I've run into involved (1) calling a function in a DLL that returned a std::vector
and then (2) letting that std::vector
fall out of scope (which is basically the whole point of std::vector
). Unfortunately it turned out that the DLL was linked to one version of the C++ runtime, and the program was linked to another; which meant that the library was calling one version of new[]
and I was calling a completely different version of delete[]
.
That is not what's happening here, because that failed every time and according to one of your comments "the bug manifests itself by a crash one in a millionth time." I would guess that there's an if
statement that gets taken once in a million times and it causes a double delete
bug.
I recently used evaluation versions of two products that may help you: IBM's Rational Purify and Intel Parallel Inspector. I'm sure there are others (Insure++ is mentioned a lot). On Linux you would use Valgrind.
If you have access to a *nix machine, you can use Valgrind.