I have 3 processes which need to be synchronized. Process one does something then wakes process two and sleeps, which does something then wakes process three and sleeps, which d
(Sorry to give a second answer but this one would be too messy to clean up just with editing)
The answer is, I think, already in the original post for the question.
So, my question is why does sem3 timeout, even though the semaphore has been triggered and the value is clearly 1? I would never expect to see line 08 in the output. If it times out (because, say thread 2 has crashed or is taking too long), the value should be 0. And why does it work fine for 3 or 4 hours first before getting into this state?
So the scenario is:
sem_timedwait
sem_getvalue
sem_post
on sem3
sem_getvalue
and sees a 1sem_post
on sem1
This race condition is hard to trigger, basically you have to hit the tiny time window where one thread has had a problem in waiting for the semaphore and then reads the semaphore with the sem_getvalue
. The occurrence of that condition is much dependent of the environment (type of system, number of cores, load, IO interrupts) so this explains why it only occurs after hours, if not at all.
Having the control flow depend of a sem_getvalue
is generally a bad idea. The only atomic non-blocking access to a sem_t
is through sem_post
and sem_trywait
.
So this example code from the question has that race condition. This doesn't mean that the original problem code that gillez had, does indeed have the same race condition. Perhaps the example is just too simplistic, and still shows the same phenomenon for him.
My guess is, in his original problem there was an unprotected sem_wait
. That is a sem_wait
that is only checked for its return value and not for errno
in the event that it fails. EINTR
s do occur on sem_wait
quite naturally if the process has some IO. You have just do a do - while
with check and reset of errno
if you encounter a EINTR
.