Bug Hunting Strategies?

前端 未结 14 1368
北恋
北恋 2021-02-08 12:00

Let\'s say you have a bug that was found in functional testing of a fairly complex part of the software. It could stem from bad/unexpected data in the database, middle-tier cod

相关标签:
14条回答
  • 2021-02-08 12:49

    In general, I start with the a subset of hypotheses that I consider the most likely culprits and then sort that subset of hypotheses by how easy each is to disprove, and start with the easiest.

    Regardless of the order, the important thing is what you do with your hypothesis. Start trying to disprove each hypothesis rather than to verify it and you'll cover more ground (see Psychology of Intelligence Analysis by Richards J. Heuer, Jr., free PDF).

    0 讨论(0)
  • 2021-02-08 12:52

    I'm with @moonshadow, but I'll add that to some degree it depends on what the failure is. That is, some sorts of failure have fairly well known causes, and I'd start with the known cause

    For example, on Windows systems "Access Violation" errors are almost always due to the attempt to use or look at (access) unallocated memory. To find the source of such an error, it's a good idea to look at all the places where memory is (or isn't) allocated.

    If it's known that the "problem" is due to bad data, then the fix may require changes to data validation or acquisition, even once the error is traced to analysis.

    One more point, while thinking through the bug it's often well worth the effort to try to create a small program for creating it.

    0 讨论(0)
  • 2021-02-08 12:52

    In my experience, it's probably best to go with gut feel (1) for 30 minutes or so.

    If nothing comes out of that, talk to someone else about it.

    It's quite amazing how talking to someone else (even if they're non technical), can help.

    0 讨论(0)
  • 2021-02-08 12:53
    1. Reproduce the bug in a debug environment.
    2. Examine system state at the point the bug occurs to find the inconsistent / incorrect / unexpected elements of state that directly, visibly led to the bug occurring. Often, just eyeballing the code and call stack will immediately tell you what the problem is.
    3. Add tests to all points where this state can be created / mutated within the normal flow of control.
    4. Treating failures of these tests as a new bug, return to step two.

    Rinse, lather, repeat until initial cause of the problem is found. Tedious and mechanical, but will get you there.

    Except... occasionally the tests in an iteration of step 3 don't fail; most commonly, because some unrelated system corrupting memory is what is leading to the invalid state, or because the problem is threading/timing/uninitialised data dependent and introducing the tests alters timings and/or data layout sufficiently to alter or hide the symptoms. At this point, for this poster at least, the process becomes a more intuitive one, alternating between replacing sanity tests with less intrusive forms and selectively disabling modules to rule out sources of corruption.

    0 讨论(0)
  • 2021-02-08 12:55

    My first step in a situation like that is usually to check things in the order that will most quickly reduce the number of things left to check. You could almost think of it as doing a binary search for the bug: "Well, the POST parameters look right, so I can rule out everything before the form submission," etc.

    That said, if I have a strong feeling that the problem might be in a particular place, I'll check that first.

    0 讨论(0)
  • 2021-02-08 12:57

    I normally do this:

    1) Add a new functional test case(s) to the automated regression test system. I normally start a software project with a own regression test system with

    • Excel VBA + C library to control SCSI/IDE interface/device (13 years ago), Test report is Excel speadsheet.
    • TCL Expect for Complex network router system testing. Test report is webpage. (6 years ago)
    • Today I use Python/Expect. Test report is XML + python base XML analyzer.

    This goal for all this works is to make sure once any bug is found, it should never show up in the checkin code or production system again. Also it is easier to reproduce the random and long term problems.

    Don't check in any code unless it goes thou an over night automate regression test.

    I typically write 1:1 ratio between product code vs. testing code. 20k lines of TCL expert for 20K lines of C++ code. (5 years ago.) For example:

    • C code would implement a setup tunnel tcp connection forwarding proxy.
    • TCL test cases: (a) Setup the connections make sure the data is pass thru. (b) Setup the connections with different network elements. (c) Do that 10, 100, 1000 times and check for memory leak and system resource issues, etc.
    • Do this for every features in the system, one can see why the 1:1 ration on test program to code.

    I don't want QA team to do automated test with my test system, since all my checkin code has to pass the tests. I usually run 2 weeks long term regression test before I give the code to the QA team.

    QA team running manual test cases also make sure my program have enough build-in diagnostic info to capture any future bugs. The goal is have enough diagnostic info to solve 95% of bugs in < 2 hours. I was able to do that in my last project. (Video network equipment at RBG Networks.)

    2) Add diagnostic routine (web base nowadays) to get all the internal information. (Current State, Logs, etc). > 50% of my code (c/c++, specially) are diagnostic code.

    3) Add more details log for trouble area that I don't understand.

    4) Analyze the info.

    5) Try fix the bug.

    6) Run over night / over the weekend regression tests. When I was in R&D, I typically ask for at lease 5-10 test systems to run continuously regression tests 24x7. That normally helps ID and solve the memory, resource and long term performance problem before the code hit SQA.

    Once an embedded system fails boot into Linux prompt from time to time. I added a test case which it power cycle the system with programmable outlet over and over again and make sure it can "see" the command prompt and start running the test overnight. We were able to quick ID the FPGA code problem and make sure the system is always up after 5000 power cycles. A test case was added and everything a new Verilog code checkin / FPGA code is built. This test case was ran. It was never an issue again.

    0 讨论(0)
提交回复
热议问题