I work with safety critical real-time systems and logging is often the only way to catch rare bugs that only turn up every 53rd tuesday when it's a full moon, if you catch my drift. This kind of makes you obsessive about the subject, so I'll apologise now if I start to froth at the mouth.
I design systems which are capable of logging pretty much everything, but I don't turn everything on by default. The debug information is sent to a hidden debug dialog which timestamps it and outputs it to a listbox (limited to around 500 lines before deletion), and the dialog allows me to stop it, save it to a log file automatically, or divert it to an attached debugger such as DBWin32. That diversion allows me to see the debug output from multiple applications all neatly serialized, which can be a life saver sometimes. The log files are automatically purged every N days. I used to use numeric logging levels (the higher you set the level, the more you capture):
- off
- errors only
- basic
- detailed
- everything
but this is too inflexible - as you work your way towards a bug it's much more efficient to be able to focus logging in on exactly what you need without having to wade through tons of detritus, and it may be one particular kind of transaction or operation that causes the error. If that requires you to turn everything on, you're just making your own job harder. You need something finer-grained.
So now I'm in the process of switching to logging based on a flag system. Everything that gets logged has a flag detailing what kind of operation it is, and there's a set of checkboxes allowing me to define what gets logged. Typically that list looks like this:
#define DEBUG_ERROR 1
#define DEBUG_BASIC 2
#define DEBUG_DETAIL 4
#define DEBUG_MSG_BASIC 8
#define DEBUG_MSG_POLL 16
#define DEBUG_MSG_STATUS 32
#define DEBUG_METRICS 64
#define DEBUG_EXCEPTION 128
#define DEBUG_STATE_CHANGE 256
#define DEBUG_DB_READ 512
#define DEBUG_DB_WRITE 1024
#define DEBUG_SQL_TEXT 2048
#define DEBUG_MSG_CONTENTS 4096
This logging system ships with the release build, turned on and saving to file by default. It's too late to find out you should have been logging AFTER the bug has occurred, if that bug only occurs once every six months on average and you have no way of reproducing it.
The software typically ships with ERROR, BASIC, STATE_CHANGE and EXCEPTION turned on, but this can be changed in the field via the debug dialog (or a registry/ini/cfg setting, where these things get saved).
Oh and one thing - my debug system generates one file per day. Your requirements may be different. But make sure your debug code starts every file with the date, version of the code you're running, and if possible some marker for the customer ID, location of the system or whatever. You can get a mish-mash of log files coming in from the field, and you need some record of what came from where and what version of the system they were running that's actually in the data itself, and you can't trust the customer/field engineer to tell you what version they've got - they may just tell you what version they THINK they've got. Worse, they may report the exe version that's on the disk, but the old version is still running because they forgot to reboot after replacing. Have your code tell you itself.
That's my brain dumped...