ubuntu: sem_timedwait not waking (C)

前端未结

关注

 6  1242

I have 3 processes which need to be synchronized. Process one does something then wakes process two and sleeps, which does something then wakes process three and sleeps, which d

相关标签:

6条回答

一个人的身影

2021-02-06 17:36

This is very interesting. While I have not located the source of the error, (still looking) I have verified this on Ubuntu 9.04 running Linux 2.6.34.

0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2021-02-06 17:37
The problem seems to come from passing an invalid timeout argument.

At least on my machine, the first failure is not ETIMEDOUT but:

!!!!!! sem2 sem_timedwait failed: Invalid argument, val now 0

Now, if I write:
```
  if (ts.tv_nsec >= 1000000)
```
(note the addition of =) then it works fine. It's another question why the state of semaphore gets (presumably) effed up so that it times out on subsequent attempts or simply blocks forever on straight sem_wait. Looks like a bug in libc or the kernel.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-02-06 17:42
(Sorry to give a second answer but this one would be too messy to clean up just with editing)

The answer is, I think, already in the original post for the question.

So, my question is why does sem3 timeout, even though the semaphore has been triggered and the value is clearly 1? I would never expect to see line 08 in the output. If it times out (because, say thread 2 has crashed or is taking too long), the value should be 0. And why does it work fine for 3 or 4 hours first before getting into this state?

So the scenario is:
1. thread 2 takes too long
2. sem3 times out in sem_timedwait
3. thread 3 is descheduled or whatever it takes it to reach the sem_getvalue
4. thread 2 wakes up and does its sem_post on sem3
5. thread 3 issues its sem_getvalue and sees a 1
6. thread 3 branches into the wrong branch and doesn't do its sem_post on sem1
This race condition is hard to trigger, basically you have to hit the tiny time window where one thread has had a problem in waiting for the semaphore and then reads the semaphore with the sem_getvalue. The occurrence of that condition is much dependent of the environment (type of system, number of cores, load, IO interrupts) so this explains why it only occurs after hours, if not at all.

Having the control flow depend of a sem_getvalue is generally a bad idea. The only atomic non-blocking access to a sem_t is through sem_post and sem_trywait.

So this example code from the question has that race condition. This doesn't mean that the original problem code that gillez had, does indeed have the same race condition. Perhaps the example is just too simplistic, and still shows the same phenomenon for him.

My guess is, in his original problem there was an unprotected sem_wait. That is a sem_wait that is only checked for its return value and not for errno in the event that it fails. EINTRs do occur on sem_wait quite naturally if the process has some IO. You have just do a do - while with check and reset of errno if you encounter a EINTR.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-02-06 17:44
Don't blame ubuntu or any other distro on it :-) What is certainly more important here is the version of gcc you are using, 32 or 64 bit etc, how many cores your system has. So please give a bit more information. But looking through your code I found several places that could just bring you unexpected behavior:
- it starts with the start, casting int in void* back and forth, you are looking for trouble. Use uintptr_t for that if you must, but here you have no excuse to just pass real pointers to the values. &(int){ 1 } and some saner casting would do the trick for C99.
- ts.tv_nsec = (tv.tv_usec + 500000) is another trouble spot. The right hand side might be of a different width then the left hand side. Do
  
  ts.tv_nsec = tv.tv_usec;
  
  ts.tv_nsec += 500000;
- The sem family of functions are not interrupt safe. Such interrupts may e.g be triggered by IO, since you are doing printf etc. Checking the return value for -1 or so is not sufficient but in such a case you should check errno and decide if you want to retry. Then you'd have to do the recalculation of the remaining time and stuff like that if you want to be precise. Then man page for sem_timedwait has a list of the different error codes that might occur and their reasons.
- You also conclude things from values that you get via sem_getvalue. In a multi-threading / multi-process / multi-processor environment your thread might have been unscheduled between the return from sem_timedwait and sem_getvalue. Basically you can't deduce anything from that, the variable is just incidentally at the value that you observe. Don't draw conclusions from that.
0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2021-02-06 17:57
I have no clue on what is going wrong and the code looks fine too me. Here are some things you could do do somehow get more information.
- use a different timeout, both shorter and longer, and see if your problem still occurs.
- use a non timed version, and see if the programm hangs.
- try to modify the behaviour of your kernel scheduler, for example using kernel command line parameters, or using procfs or sysfs.
As pointed by Jens, there are two races :

The first is when evaluationg the value of the semaphore, after the call to sem_timedwait. This is not changing the control flow whit respects to semaphore. Wether the thread timedout or not, it still goes through the "should I trigger the next thread" block.

The second is in the "Should I wakeup the next thread" part. We could have the following events :
1. Threads n calls sem_getvalue(trigger) and gets a 1
2. Thread n+1 returns from sem_timedwait and the semaphore goes to 0
3. Thread n decides not to post and the semaphore stays to 0
Now, I can't see how this could trigger the observed behaviour. After all, since Thread n+1 is waked up anyway, it will in turn wake up thread n+2 which will wake up thread n etc...

While it is possible to get glitches, I can't see how this could lead to systematic timeout from a thread.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-02-06 18:01
I gave the program a shot on my Ubuntu 10.04 x86_64 Core i7 machine.

When running with usleep(40000), the program ran fine for half an hour or something boring.

When running with usleep(40), the program ran fine for another half hour, maybe more, before my machine froze. X died. Control+alt+F1-7 died. I couldn't ssh in. (Sadly, this goofy Apple keyboard doesn't have a sysrq key. I like typing on it, but I sure don't need f13, f14, or f15. I'd do horrible things to get a proper sysrq key.)

And the absolute best: NOTHING in my logs tells me what happened.
```
$ uname -a
Linux haig 2.6.32-22-generic #36-Ubuntu SMP Thu Jun 3 19:31:57 UTC 2010 x86_64 GNU/Linux
```
At the same time, I was also playing a Java game in browser (posted by a fellow stackoverflow user looking for feedback, fun diversion :) -- so it's possible that the jvm is responsible for tickling something to freeze my machine solid.
0 讨论(0)
发布评论:

提交评论
- 加载中...