Let\'s say you have a bug that was found in functional testing of a fairly complex part of the software. It could stem from bad/unexpected data in the database, middle-tier cod
In general, I start with the a subset of hypotheses that I consider the most likely culprits and then sort that subset of hypotheses by how easy each is to disprove, and start with the easiest.
Regardless of the order, the important thing is what you do with your hypothesis. Start trying to disprove each hypothesis rather than to verify it and you'll cover more ground (see Psychology of Intelligence Analysis by Richards J. Heuer, Jr., free PDF).
I'm with @moonshadow, but I'll add that to some degree it depends on what the failure is. That is, some sorts of failure have fairly well known causes, and I'd start with the known cause
For example, on Windows systems "Access Violation" errors are almost always due to the attempt to use or look at (access) unallocated memory. To find the source of such an error, it's a good idea to look at all the places where memory is (or isn't) allocated.
If it's known that the "problem" is due to bad data, then the fix may require changes to data validation or acquisition, even once the error is traced to analysis.
One more point, while thinking through the bug it's often well worth the effort to try to create a small program for creating it.
In my experience, it's probably best to go with gut feel (1) for 30 minutes or so.
If nothing comes out of that, talk to someone else about it.
It's quite amazing how talking to someone else (even if they're non technical), can help.
Rinse, lather, repeat until initial cause of the problem is found. Tedious and mechanical, but will get you there.
Except... occasionally the tests in an iteration of step 3 don't fail; most commonly, because some unrelated system corrupting memory is what is leading to the invalid state, or because the problem is threading/timing/uninitialised data dependent and introducing the tests alters timings and/or data layout sufficiently to alter or hide the symptoms. At this point, for this poster at least, the process becomes a more intuitive one, alternating between replacing sanity tests with less intrusive forms and selectively disabling modules to rule out sources of corruption.
My first step in a situation like that is usually to check things in the order that will most quickly reduce the number of things left to check. You could almost think of it as doing a binary search for the bug: "Well, the POST parameters look right, so I can rule out everything before the form submission," etc.
That said, if I have a strong feeling that the problem might be in a particular place, I'll check that first.
I normally do this:
1) Add a new functional test case(s) to the automated regression test system. I normally start a software project with a own regression test system with
This goal for all this works is to make sure once any bug is found, it should never show up in the checkin code or production system again. Also it is easier to reproduce the random and long term problems.
Don't check in any code unless it goes thou an over night automate regression test.
I typically write 1:1 ratio between product code vs. testing code. 20k lines of TCL expert for 20K lines of C++ code. (5 years ago.) For example:
I don't want QA team to do automated test with my test system, since all my checkin code has to pass the tests. I usually run 2 weeks long term regression test before I give the code to the QA team.
QA team running manual test cases also make sure my program have enough build-in diagnostic info to capture any future bugs. The goal is have enough diagnostic info to solve 95% of bugs in < 2 hours. I was able to do that in my last project. (Video network equipment at RBG Networks.)
2) Add diagnostic routine (web base nowadays) to get all the internal information. (Current State, Logs, etc). > 50% of my code (c/c++, specially) are diagnostic code.
3) Add more details log for trouble area that I don't understand.
4) Analyze the info.
5) Try fix the bug.
6) Run over night / over the weekend regression tests. When I was in R&D, I typically ask for at lease 5-10 test systems to run continuously regression tests 24x7. That normally helps ID and solve the memory, resource and long term performance problem before the code hit SQA.
Once an embedded system fails boot into Linux prompt from time to time. I added a test case which it power cycle the system with programmable outlet over and over again and make sure it can "see" the command prompt and start running the test overnight. We were able to quick ID the FPGA code problem and make sure the system is always up after 5000 power cycles. A test case was added and everything a new Verilog code checkin / FPGA code is built. This test case was ran. It was never an issue again.