RDMA program randomly hangs

微笑、不失礼 提交于 2019-12-04 16:59:48

Curses! I was bit by a bug in librdmacm-1.0.15-1 (from 2012) that came with SUSE 11. I knew there was nothing wrong with my send/recv sequencing.

I first tried comparing my code with other examples. In one example I saw

while (!ibv_poll_cq(id->send_cq, 1, &wc));

instead of rdma_get_send_comp() and likewise for rdma_get_recv_comp(). I tried replacing those in my example and miraculously, the hanging is gone!

Hmm, maybe rdma_get_send_comp() isn't doing what I'm expecting. I'd better take a look at the code. I got the code for both 1.0.15 and 1.0.18 and what do I see in rdma_verbs.h?

2 very different IB verb sequences:

// 1.0.15
rdma_get_send_comp(struct rdma_cm_id *id, struct ibv_wc *wc)
{
        struct ibv_cq *cq;
        void *context;
        int ret;

        ret = ibv_poll_cq(id->send_cq, 1, wc);
        if (ret)
                goto out;

        ret = ibv_req_notify_cq(id->send_cq, 0);
        if (ret)
                return rdma_seterrno(ret);

        while (!(ret = ibv_poll_cq(id->send_cq, 1, wc))) {
                ret = ibv_get_cq_event(id->send_cq_channel, &cq, &context);
                if (ret)
                        return rdma_seterrno(ret);

                assert(cq == id->send_cq && context == id);
                ibv_ack_cq_events(id->send_cq, 1);
        }
out:
        return (ret < 0) ? rdma_seterrno(ret) : ret;
}

vs
// 1.0.18
rdma_get_send_comp(struct rdma_cm_id *id, struct ibv_wc *wc)
{
        struct ibv_cq *cq;
        void *context;
        int ret;

        do {
                ret = ibv_poll_cq(id->send_cq, 1, wc);
                if (ret)
                        break;

                ret = ibv_req_notify_cq(id->send_cq, 0);
                if (ret)
                        return rdma_seterrno(ret);

                ret = ibv_poll_cq(id->send_cq, 1, wc);
                if (ret)
                        break;

                ret = ibv_get_cq_event(id->send_cq_channel, &cq, &context);
                if (ret)
                        return ret;

                assert(cq == id->send_cq && context == id);
                ibv_ack_cq_events(id->send_cq, 1);
        } while (1);

        return (ret < 0) ? rdma_seterrno(ret) : ret;
}

Can anyone explain why 1.0.18 works while 1.0.15 randomly hangs?

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!