Most advice concerning error handling boils down to a handful of tips and tricks (see this post for example). These hints are helpful but I think they don\'t answer all question
I'm changing my design and coding philosophy so that:
Hopefully, with this technique, the issues that get propagated to the User will be very important; otherwise the program tries to resolve them.
I'm currently experiencing issues that get lost in the return codes; or new return codes are created.
The book "Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries" book by Krzysztof Cwalina and Brad Abrams has some good suggestions on this. See chapter 7 on Exceptions. For example it favours throwing exceptions to returning error codes.
-Krip
Error handling is not accompanied by formal theory. It is too 'implementation specific' of a topic to be considered a science field (to be fair there is a great debate whether programming is a science on its own right).
Nontheless it a good part of a developer's work (and thus his/hers life), so practical approaches and technical guidliness have been developed on the topic.
A good view on the topic is presented by A. Alexandrescu, in his talk systematic error handling in C++
I have a repository in GitHub where the techniques presented are implemented.
Basically, what A.A does, is implement a class
template<class T>
class Expected { /* Implementation in the GitHub link */ };
that is meant to be used as a return value. This class could hold either a return value of type T
or an exception (pointer). The exception could be either thrown explictly or upon request, yet the rich error information is always available. An example usage would be like this
int foo();
// ....
Expected<int> ret = foo();
if (ret.valid()) {
// do the work
}
else {
// either use the info of the exception
// or throw the exception (eg in an exception "friendly" codebase)
}
While building this framework for error handling, A.A walks us through techniques and designs that produce successfull or poor error handling and what works or what not. He also gives his definitions of 'error' and 'error handling'
How to decide if an error should be handled locally or propagated to higher level code?
Error handling should be done at the highest affected level. If it only impacts the lower level code, then it should be handled there. If the error affects higher level code, then the error needs to be handled at the higher level. This is to prevent some higher level code from going on its merry way after an error has caused its actions to be incorrect. It should know what is going on, provided it is impacted.
How to decide between logging an error, or showing it as an error message to the user?
You should always log the error. You should show the error to the user when they are affected by it. If it is something they will never notice and does not have a direct impact (e.g. two sockets failed to open before the third finally opened, resulting in a very short delay for the user should not be reported), then they should not be notified.
Is logging something that should only be done in application code? Or is it ok to do some logging from library code.
Too much logging is rarely a bad thing. You will regret not logging things when you have to hunt down a library bug more than you will be frustrated by extra logs when hunting down other bugs.
In case of exceptions, where should you generally catch them? In low-level or higher level code?
Similar to error handling above, it should be caught where the impact is, and where the error can be corrected/handled effectively. This will vary from case to case.
Should you strive for a unified error handling strategy through all layers of code, or try to develop a system that can adapt itself to a variety of error handling strategies (in order to be able to deal with errors from 3rd party libraries).
This is largely a personal decision. My internal error handling is much different than the error handling I use for anything that touches a third party library. I have a general idea of what to expect from my code, but the third party stuff could have anything happen to it.
Does it make sense to create a list of error codes? Or is that old fashioned these days? Depends how much you expect to have errors thrown. You might love your list of error codes if you spend a lot of time bug hunting, as they can help point you in the right direction. However, any time spent building these is less time spent on coding/bug fixing, so its a mixed bag. This largely comes down to personal preference.
To understand what needs to be done for error handling, I think one needs clearly to understand the types of errors one encounters, and the contexts in which one encounters them.
To me, it has been extremely useful to consider the two major types of errors as:
Errors that should never happen, and are typically due to a bug in the code.
Errors which are expected and cannot be prevented in normal operation, such as a database connection going down because of a database issue over which the application has no control.
The way an error should be handled depends heavily on which type of error it is.
The differing contexts which affect how errors should be handled are:
Application code
Library code
The handling of errors in library code differs somewhat from the handling in application code.
A philosophy for handling of the two major types of errors is discussed below. The special considerations for library code are also addressed. Finally, the specific practical questions in the original post are addressed in the context of the philosophy presented.
Many errors are the result of programming mistakes. These errors typically cannot be corrected, since the specific programming mistake cannot be anticipated. That means we can't know in advance what condition the mistake leaves the application in, so we can't recover from that condition and shouldn't try.
Ultimately, the fix to this kind of error is to fix the programming mistake. To facilitate that, the error should be surfaced as quickly as possible. Ideally, the program should exit immediately after identifying such an error and providing the relevant information. A quick and obvious exit reduces the time required to complete the debug and retest cycle, permitting more bugs to be fixed in the same amount of testing time; that in turn results in having a more robust application with fewer bugs when it comes time to deploy.
The other major objective in handling this type of error should be to provide sufficient information to make it easy to identify the bug. In Java, for example, throwing a RuntimeException often provides sufficient information in the stack trace to identify the bug immediately; in clean code, immediate fixes can often be identified just from examining the stack trace. In other languages, one might log the call stack or otherwise preserve the necessary information. It is critical not to suppress information in the interests of brevity; don't worry about how much log space you are taking up when this type of error occurs. The more information that is provided, the quicker the bugs can be fixed, and the fewer bugs will remain to pollute the logs when the application makes it to production.
Now, in some server applications, it's important that the server be sufficiently fault tolerant to continue operation even in the face of occasional programming errors. In this case, the best approach is to have a very clear separation between the server code that must continue operation and the task processing code that can be allowed to fail. For example, tasks can be relegated to threads or subprocesses, as is done in many web servers.
In such a server architecture, the thread or subprocess handling the task can then be treated like an application which can fail. All the considerations above apply to such a task: the error should be surfaced as quickly as possible by a clean exit from the task, and sufficient information should be logged to permit the bug to be easily found and fixed. When such a task exits in Java, for example, the entire stack trace of any RuntimeException causing the exit should normally be logged.
As much of the code as possible should be executed within the threads or processes handling the task, rather than in the main server thread or process. This is because any bug in the main server thread or process will still cause the entire server to go down. It's better to push the code - with the bugs it contains - into the task handling code where it won't cause a server crash when the bug manifests itself.
Errors that are expected and cannot be prevented in normal operation, such as an exception from a database or other service separate from the application, require very different treatment. In these cases, the objective is not to fix the code, but rather to have the code handle the error when that makes sense, and inform users or operators who can fix the problem otherwise.
In these cases, for example, the application may wish to throw away any results that have accumulated thus far, and retry the operation. In database access, use of transactions can help ensure that accumulated data is discarded. In other cases, it can be useful to write one's code with such retries in mind. The concept of idempotency can also be useful here.
When automated retries won't sufficiently solve the problem, human beings should be informed. The user should be informed that the operation failed; often the user can be given the option of retrying. The user can then judge whether a retry is desirable, and can also make alterations in input that might help things go better on a retry.
For this type of error, logging and perhaps email notices can be used to inform system operators. Unlike logging of programming errors, logging of errors that are expected in normal operation should be more succinct, since the error may happen many times and appear many times in the logs; operators will often be analyzing the pattern of many errors, rather than focusing on one individual error.
The above discussion of types of errors is directly applicable to application code. The other major context for error handling is library code. Library code still has the same two basic types of errors, but it typically cannot or should not communicate directly with the user, and and it has less knowledge about the application context, including whether an immediate exit is acceptable, than does the application code.
As a result, there are differences in how libraries should handle logging, how they should handle errors that may be expected in normal operation, and how they should handle programming errors and other errors that should never happen.
With respect to logging, the library should if possible support logging in the format desired by the client application code. One valid approach is to do no logging at all, and allow the application code to do all logging based on error information provided to the application code by the library. Another approach is to use a configurable logging interface, allowing the client application to provide the implementation for the logging, for example when the library is first loaded. In Java, for example, the library might use the logback logging interface, and allow the application to worry about what logging implementation to configure for logback to use.
For bugs and other errors that should never happen, libraries still cannot simply exit the application, since that may not be acceptable to the application. Rather, libraries should exit the library call, providing the caller with sufficient information to help diagnose the problem. The information may be provided in the form of an exception with a stack trace, or the library may log the information if the configurable logging approach is being used. The application can then treat this as it would any other error of this type, typically by exiting, or in a server, by allowing the task process or thread to exit, with the same logging or error reporting that would be done for programming errors in the application code.
Errors that are expected in normal operation should be also be reported to the client code. In this case, as with this type of error when encountered in the client code, the information associated with the error can be more succinct. Typically libraries should do less local handling of this type of error, relying more on the client code to decide things like whether to retry and for how many times. The client code can then pass along the retry decision to the user if desired.
Now that we have the philosophy, let's apply it to the practical questions you mention.
If it is an error that is expected in normal operation, retry or possibly consult the user locally. Otherwise, propagate it to higher level code.
If it is an error that is expected in normal operation, and user input would be useful to determine what action to take, get user input and log a succinct message; if it seems to be a programming error, provide the user with a brief notification and log more extensive information.
Logging from the library code should be under the control of the client code. At most, the library should log to an interface for which the client provides the implementation.
Exceptions that are expected in normal operation can be caught locally and the operation retried or otherwise handled. In all other cases, exceptions should be allowed to propagate.
The types of errors in third party libraries are the same types of errors that occur in application code. Errors should be handled primarily according to which error type they represent, with relevant adjustments for library code.
Application code should provide a complete description of the error in the case of programming errors, and a succinct description in the case of errors that can occur in normal operation; in either case, a description is normally more appropriate than an error code. Libraries may provide an error code as a way of describing whether an error is a programming or other internal error, or whether the error is one which can occur in normal operation, with the latter type perhaps subdivided more finely; however, an exception hierarchy can be more useful than an error code in languages where such is possible. Note that applications run from the command line may act as libraries for shell scripts, however.
Always handle as soon as possible. The closer you are to its occurrence the more chance you have to do something meaningful or at the least figure out where and why it happened. In C++, it is not just a matter of context but being impossible to determine in many cases.
In general you should always halt the app if something buggy occurs that is a real error (not something like not finding a file, which is not really something that should count as an error but is labeled as such). It's not going to just sort itself out, and once the app is broken it will cause errors that are impossible to debug because they have nothing to do with the area they occur.
Why not?
see 1.
see 1.
You need to keep things simple, or you will regret it. More important to handling bugs at runtime is testing to avoid them.
It's like saying is it better to centralize or not centralize. It might make a lot of sense in some cases but be a waste of time in others. For something that is a loadable lib/module of some kind that can have errors that are data related (garbage in, garbage out), it makes tons of sense. For more general error handling or catastrophic errors, less.