I created 2 Linux kernel threads in my loadable module and I bind them to separate CPU cores running on a dual core Android device. After I run this few times, I noticed that the device reboots with a HW watchdog timer reset. I hit the issue consistently. What could be causing the deadlock?
Basically, what i need to do is, make sure both the threads run do_something() at the same time on different cores without anybody stealing the cpu cycles(i.e. interrupts are disabled). I am using a spinlock and a volatile variable for this. I also have a semaphore for parent thread to wait on child thread.
#define CPU_COUNT 2
/* Globals */
spinlock_t lock;
struct semaphore sem;
volatile unsigned long count;
/* Thread util function for binding the thread to CPU*/
struct task_struct* thread_init(kthread_fn fn, void* data, int cpu)
{
struct task_struct *ts;
ts=kthread_create(fn, data, "per_cpu_thread");
kthread_bind(ts, cpu);
if (!IS_ERR(ts)) {
wake_up_process(ts);
}
else {
ERR("Failed to bind thread to CPU %d\n", cpu);
}
return ts;
}
/* Sync both threads */
void thread_sync()
{
spin_lock(&lock);
++count;
spin_unlock(&lock);
while (count != CPU_COUNT);
}
void do_something()
{
}
/* Child thread */
int per_cpu_thread_fn(void* data)
{
int i = 0;
unsigned long flags = 0;
int cpu = smp_processor_id();
DBG("per_cpu_thread entering (cpu:%d)...\n", cpu);
/* Disable local interrupts */
local_irq_save(flags);
/* sync threads */
thread_sync();
/* Do something */
do_something();
/* Enable interrupts */
local_irq_restore(flags);
/* Notify parent about exit */
up(&sem);
DBG("per_cpu_thread exiting (cpu:%d)...\n", cpu);
return value;
}
/* Main thread */
int main_thread()
{
int cpuB;
int cpu = smp_processor_id();
unsigned long flags = 0;
DBG("main thread running (cpu:%d)...\n", cpu);
/* Init globals*/
sema_init(&sem, 0);
spin_lock_init(&lock);
count = 0;
/* Launch child thread and bind to the other CPU core */
if (cpu == 0) cpuB = 1; else cpuB = 0;
thread_init(per_cpu_thread_fn, NULL, cpuB);
/* Disable local interrupts */
local_irq_save(flags);
/* thread sync */
thread_sync();
/* Do something here */
do_something();
/* Enable interrupts */
local_irq_restore(flags);
/* Wait for child to join */
DBG("main thread waiting for all child threads to finish ...\n");
down_interruptible(&sem);
}
I'm not sure, this is a real reason, but your code contains some serious errors.
First in while (count != CPU_COUNT);
. You must not read shared variable without holding a lock, unless read is atomic. With count
it isn't guaranteed to be.
You must protect read of count
with lock. You can replace your while-loop with following:
unsigned long local_count;
do {
spin_lock(&lock);
local_count = count;
spin_unlock(&lock);
} while (local_count != CPU_COUNT);
Alternatively, you could use atomic types. Notice absence of locking
atomic_t count = ATOMIC_INIT(0);
...
void thread_sync() {
atomic_inc(&count);
while (atomic_read(&count) != CPU_COUNT);
}
Second problem with interrupts. I think, you don't understand what you are doing.
local_irq_save()
saves and disables interrupts. Then, you disable interrupts again with local_irq_disable()
. After some work, you restore previous state with local_irq_restore()
, and enable interrupts with local_irq_enable()
. This enabling is totally wrong. You enable interrupts, regardless of theirs previous state.
Third problem. If main thread isn't binded to a cpu, you should not use smp_processor_id()
unless you are sure that kernel will not reschedule right after you get a cpu number. It's better to use get_cpu()
, which disables kernel preemption and then returns cpu id. When done, call put_cpu()
.
But, when you call get_cpu()
, this is a bug to create and run other threads. That's why you should set affinity of main thread.
Fourth. local_irq_save()
and local_irq_restore()
macros that takes a variable, not a pointer to unsigned long
. (I've got an error and some warnings passing pointers. I wonder how did you compile your code). Remove referencing
The final code is available here: http://pastebin.com/Ven6wqWf
来源:https://stackoverflow.com/questions/18022958/why-does-this-code-deadlock