问题
So We have a problem where a penetration checker being run for something like 12 hours is causing Jgroups to disconnect, the slave doesn't rejoin the cluster, split brain, some other issues that represent the lack of replication, and it doesn't recover.
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.6.xsd">
<TCP bind_addr="NON_LOOPBACK"
bind_port="${infinispan.jgroups.bindPort}"
enable_diagnostics="false"
thread_naming_pattern="pl"
send_buf_size="640k"
sock_conn_timeout="300"
thread_pool.min_threads="${jgroups.thread_pool.min_threads:2}"
thread_pool.max_threads="${jgroups.thread_pool.max_threads:30}"
thread_pool.keep_alive_time="60000"
thread_pool.queue_enabled="false"
internal_thread_pool.min_threads="${jgroups.internal_thread_pool.min_threads:5}"
internal_thread_pool.max_threads="${jgroups.internal_thread_pool.max_threads:20}"
internal_thread_pool.keep_alive_time="60000"
internal_thread_pool.queue_enabled="true"
internal_thread_pool.queue_max_size="500"
oob_thread_pool.min_threads="${jgroups.oob_thread_pool.min_threads:20}"
oob_thread_pool.max_threads="${jgroups.oob_thread_pool.max_threads:200}"
oob_thread_pool.keep_alive_time="60000"
oob_thread_pool.queue_enabled="false"
/>
<TCPPING async_discovery="true"
initial_hosts="${infinispan.jgroups.tcpping.initialhosts}"
port_range="1"/>
/>
<MERGE3 min_interval="10000"
max_interval="30000"
/>
<FD_SOCK />
<FD />
<VERIFY_SUSPECT />
<pbcast.NAKACK2 use_mcast_xmit="false"
xmit_interval="1000"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
max_msg_batch_size="100"
resend_last_seqno="true"
/>
<UNICAST3 xmit_interval="500"
xmit_table_num_rows="50"
xmit_table_msgs_per_row="1024"
xmit_table_max_compaction_time="30000"
max_msg_batch_size="100"
conn_expiry_timeout="0"
/>
<pbcast.STABLE stability_delay="500"
desired_avg_gossip="5000"
max_bytes="1M"
/>
<pbcast.GMS print_local_addr="true" join_timeout="15000"/>
<pbcast.FLUSH />
<FRAG2 />
</config>
versions
jgroups 3.6.13
infinispan 8.1.0,
hibernate search 5.3
I'm wondering if we can change our jgroups configuration so that the cluster node will eventually be able to rejoin. Even after 12 hours of "attack" so that we don't have to restart the servers.
回答1:
Define disconnect for me first, please!
Regarding your stack, I have a few suggestions / questions:
- I suggest in general to use
tcp.xml
from the version you use and then modify it according to your needs - TCPPING: does initial_hosts contain all cluster members?
- Replace FD with FD_ALL
- STABLE: desired_avg_gossip of 5s is a bit small; this generates more traffic than needed
- GMS.join_timeout of 15s is quite high; this is the startup time of the first member, and it also influences discovery time
- What do you need FLUSH for?
来源:https://stackoverflow.com/questions/42656580/how-can-i-make-jgroups-reconnect-even-after-a-long-period-of-time