helping a client out with an issue that they are having. I\'m more of a sysadmin/DBA guy so I\'m struggling with helping them out. They are saying it is a bug in the kerne
The problem they are experiencing is that the main thread starts a separate thread to do a potentially long-running TCP connect() [client connecting to server]. If the 'long-running' aspect takes too long, they cancel the thread and start another one.
Trivial fix -- don't cancel the thread. Is it doing any harm? If necessary, have the thread check (when the connect
finally does complete) whether the connection is still needed and, if not, close it, release the mutex, and terminate. You can do this with a boolean variable protected by a mutex.
Also, a thread should not hold a mutex while waiting for network I/O. Mutexes should be used only for things that are fast and primarily CPU-limited or perhaps limited by local disk.
Finally, if you feel you need to reach in from the outside and force a thread to do something, step back. You wrote the code for that thread. If you feel that need, it means you didn't code that thread to do what you really wanted it to do. The fix is to modify the thread to do what, and only what, you actually want. Then you won't have to "push it around" from the outside.
It's correct that cancelled threads do not unlock mutexes they hold, you need to arrange for that to happen manually, which can be tricky as you need to be very careful to use the right cleanup handlers around every possible cancellation point. Assuming you're using pthread_cancel
to cancel the thread and setting cleanup handlers with pthread_cleanup_push
to unlock the mutexes, there are a couple of alternatives you could try which might be simpler to get right and so may be more reliable.
Using RAII to unlock the mutex will be more reliable. On GNU/Linux pthread_cancel
is implemented with a special exception of type __cxxabi::__forced_unwind
, so when a thread is cancelled an exception is thrown and the stack is unwound. If a mutex is locked by an RAII type then its destructor will be guaranteed to run if the stack is unwound by a __forced_unwind
exception. Boost Thread provides a portable C++ library that wraps Pthreads and is much easier to use. It provides an RAII type boost::mutex
and other useful abstractions. Boost Thread also provides its own "thread interruption" mechanism which is similar to Pthread cancellation but not the same, and Pthread cancellation points (such as connect
) are not Boost Thread interruption points, which can be helpful for some applications. However in your client's case since the point of cancellation is to interrupt the connect
call they probably do want to stick with Pthread cancellation. The (non-portable) way GNU/Linux implements cancellation as an exception means it will work well with boost::mutex
.
There is really no excuse for explicitly locking and unlocking mutexes when you're writing in C++, IMHO the most important and most useful feature of C++ is destructors which are ideal for automatically releasing resources such as mutex locks.
Another option would be to use a robust mutex, which is created by calling pthread_mutexattr_setrobust on a pthread_mutexattr_t
before initializing the mutex. If a thread dies while holding a robust mutex the kernel will make a note of it so that the next thread which tries to lock the mutex gets the special error code EOWNERDEAD
. If possible, the new thread can make the data protected by the thread consistent again and take ownership of the mutex. This is much harder to use correctly than simply using an RAII type to lock and unlock the mutex.
A completely different approach would be to decide if you really need to hold the mutex lock while calling connect
. Holding mutexes during slow operations is not a good idea. Can't you call connect
then if successful lock the mutex and update whatever shared data is being protected by the mutex?
My preference would be to both use Boost Thread and avoid holding the mutex for long periods.