How can I make Jgroups reconnect even after a long period of time?

时光毁灭记忆、已成空白 提交于 2019-12-12 02:15:51

问题


So We have a problem where a penetration checker being run for something like 12 hours is causing Jgroups to disconnect, the slave doesn't rejoin the cluster, split brain, some other issues that represent the lack of replication, and it doesn't recover.

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">
   <TCP bind_addr="NON_LOOPBACK"
        bind_port="${infinispan.jgroups.bindPort}"
        enable_diagnostics="false"
        thread_naming_pattern="pl"
        send_buf_size="640k"
        sock_conn_timeout="300"

        thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"
        thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"
        thread_pool.keep_alive_time="60000"
        thread_pool.queue_enabled="false"

        internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}"
        internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"
        internal_thread_pool.keep_alive_time="60000"
        internal_thread_pool.queue_enabled="true"
        internal_thread_pool.queue_max_size="500"

        oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}"
        oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"
        oob_thread_pool.keep_alive_time="60000"
        oob_thread_pool.queue_enabled="false"
   />
   <TCPPING async_discovery="true"
            initial_hosts="${infinispan.jgroups.tcpping.initialhosts}"
            port_range="1"/>
   />
   <MERGE3 min_interval="10000" 
           max_interval="30000" 
   />
   <FD_SOCK />
   <FD />
   <VERIFY_SUSPECT />
   <pbcast.NAKACK2 use_mcast_xmit="false"
                   xmit_interval="1000"
                   xmit_table_num_rows="50"
                   xmit_table_msgs_per_row="1024"
                   xmit_table_max_compaction_time="30000"
                   max_msg_batch_size="100"
                   resend_last_seqno="true"
   />
   <UNICAST3 xmit_interval="500"
             xmit_table_num_rows="50"
             xmit_table_msgs_per_row="1024"
             xmit_table_max_compaction_time="30000"
             max_msg_batch_size="100"
             conn_expiry_timeout="0"
   />
   <pbcast.STABLE stability_delay="500"
                  desired_avg_gossip="5000"
                  max_bytes="1M"
   />
   <pbcast.GMS print_local_addr="true"  join_timeout="15000"/>
   <pbcast.FLUSH />
   <FRAG2 />
</config>

versions

jgroups 3.6.13
infinispan 8.1.0, 
hibernate search 5.3

I'm wondering if we can change our jgroups configuration so that the cluster node will eventually be able to rejoin. Even after 12 hours of "attack" so that we don't have to restart the servers.


回答1:


Define disconnect for me first, please!

Regarding your stack, I have a few suggestions / questions:

  • I suggest in general to use tcp.xml from the version you use and then modify it according to your needs
  • TCPPING: does initial_hosts contain all cluster members?
  • Replace FD with FD_ALL
  • STABLE: desired_avg_gossip of 5s is a bit small; this generates more traffic than needed
  • GMS.join_timeout of 15s is quite high; this is the startup time of the first member, and it also influences discovery time
  • What do you need FLUSH for?


来源:https://stackoverflow.com/questions/42656580/how-can-i-make-jgroups-reconnect-even-after-a-long-period-of-time

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!