We have a bug in our application that does not occur every time and therefore we don\'t know its \"logic\". I don\'t even get it reproduced in 100 times today.
Disclaime
It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug. Also make a priority list of potentially sensitive articulations and features in your project. For example : 1 - Multithreading 2 - Wild pointers/ loose arrays 3 - Reliance on input devices etc. This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.
This has worked for really weird heisenbugs. (I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing. Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop) worse case, you have to remove sub-systems one at a time. stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1. It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size. Make it as fine as your OS vendor allows! Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.
Along with a lot of patience, a quiet prayer & cursing you would need:
HTH.
Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.
The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.
Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards