skynet源码分析之skynet_monitor

使用skynet框架中，偶尔会遇到A message from [ :0000000b ] to [ :0000000c ] maybe in an endless loop (version = 13187)类似的error，意思是0000000c服务处理0000000b服务发过来的某条消息时可能陷入死循环。出现这种error的原因：业务层发生死循环或者比较耗时（超过5s）。这就是skyent_monitor的作用。

1. skynet启动时会启动一个monitor线程，用来监测各个工作线程。5s循环一次，调用skynet_monitor_check()检测工作线程，稍后说明。

 1 // skynet-src/skynet_start.c
 2 static void *
 3 thread_monitor(void *p) {
 4         struct monitor * m = p;
 5         int i;
 6         int n = m->count;
 7         skynet_initthread(THREAD_MONITOR);
 8         for (;;) {
 9                 CHECK_ABORT
10                 for (i=0;i<n;i++) {
11                         skynet_monitor_check(m->m[i]);
12                 }
13                 for (i=0;i<5;i++) {
14                         CHECK_ABORT
15                         sleep(1);
16                 }
17         }
18 
19         return NULL;
20 }

每个工作线程指定一个skynet_monitor c结构的变量，处理消息前，会记录消息的源地址和目的地址(第5行)；处理完消息，清空记录(第7行)。并且会累加version(15行)。

 1 // skynet-src/skynet_server.c
 2 struct message_queue * 
 3 skynet_context_message_dispatch(struct skynet_monitor *sm, struct message_queue *q, int weight){
 4     ...
 5     skynet_monitor_trigger(sm, msg.source , handle);
 6     dispatch_message(ctx, &msg); //处理消息
 7     skynet_monitor_trigger(sm, 0,0);
 8 }
 9 
10 // skynet-src/skynet_monitor.c
11 void 
12 skynet_monitor_trigger(struct skynet_monitor *sm, uint32_t source, uint32_t destination) {
13         sm->source = source;
14         sm->destination = destination;
15         ATOM_INC(&sm->version);
16 }

monitor线程每5s调用一次skynet_monitor_check()检测工作线程：

没有死循环或者很耗时的操作，version在不断累加，设置check_version等于version(第10行)。

如果version等于check_version，有两种情况

1. 这5s期间工作线程没有消息可处理，此时，destination为0

2. 在处理某条消息时，耗时超过5s，说明有可能死循环了，于是就有了最初的错误日志(第7行)。

 1 // skynet-src/skynet_monitor.c
 2 void 
 3 skynet_monitor_check(struct skynet_monitor *sm) {
 4         if (sm->version == sm->check_version) {
 5                 if (sm->destination) {
 6                         skynet_context_endless(sm->destination);
 7                         skynet_error(NULL, "A message from [ :%08x ] to [ :%08x ] maybe in an endless loop (version = %d)", sm->source , sm->destination, sm->version);
 8                 }
 9         } else {
10                 sm->check_version = sm->version;
11         }
12 }

2. skynet定义了一个monitor结构

// skynet-src/skynet_start.c
struct monitor {
        int count; //工作线程总数
        struct skynet_monitor ** m; //监测各个工作线程
        pthread_cond_t cond; //条件变量
        pthread_mutex_t mutex; //锁
        int sleep; //休眠的工作线程数
        int quit;
};

当没有消息处理(全局消息队列为空)时，工作线程进入休眠(第14行)，并且累加m->sleep变量(第10行)。等待socket线程，timer线程唤醒。

 1 // skynet-src/skynet_start.c
 2 static void *
 3 thread_worker(void *p) {
 4         ...
 5         struct message_queue * q = NULL;
 6         while (!m->quit) {
 7                 q = skynet_context_message_dispatch(sm, q, weight);
 8                 if (q == NULL) {
 9                         if (pthread_mutex_lock(&m->mutex) == 0) {
10                                 ++ m->sleep;
11                                 // "spurious wakeup" is harmless,
12                                 // because skynet_context_message_dispatch() can be call at any time.
13                                 if (!m->quit)
14                                         pthread_cond_wait(&m->cond, &m->mutex);
15                                 -- m->sleep;
16                                 if (pthread_mutex_unlock(&m->mutex)) {
17                                         fprintf(stderr, "unlock mutex error");
18                                         exit(1);
19                                 }
20                         }
21                 }
22         }
23         return NULL;
24 }

第6行，调用pthread_cond_signal唤醒阻塞的工作线程。

当socket线程接收到数据时，只有当所有工作线程都休眠时才会去唤醒wakeup(m,0)

当timer定时器线程到达时，只要有工作线程休眠，都会去唤醒wakeup(m,m->count-1)，是因为定时器事件需要及时处理。

 1 // skynet-src/skynet_start.c
 2 static void
 3 wakeup(struct monitor *m, int busy) {
 4         if (m->sleep >= m->count - busy) {
 5                 // signal sleep worker, "spurious wakeup" is harmless
 6                 pthread_cond_signal(&m->cond);
 7         }
 8 }
 9 
10 static void *
11 thread_socket(void *p) {
12         struct monitor * m = p;
13         ...
14         wakeup(m,0);
15         return NULL;
16 }
17 
18 static void *
19 thread_timer(void *p) {
20         ...
21         wakeup(m,m->count-1);
22         ...
23         return NULL;
24 }

来源：oschina

链接：https://my.oschina.net/u/4402150/blog/3905781

标签

skynet

harmless