问题
I am running nodetool repair -pr -full my_ks my_tbl
in our Cassandra cluster(has two DCs).
It sometimes hangs with below debug logs. It works after restarting the Cassandra process. Any hints on the root cause of this issue?
DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2571566434 for /10.22.38.223
DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2495429260 for /10.22.38.26
DEBUG [GossipStage:1] 2020-07-13 14:04:22,818 FailureDetector.java:456 - Ignoring interval time of 2571592685 for /10.32.146.85
INFO [Thread-181] 2020-07-13 14:04:22,900 RepairRunnable.java:125 - Starting repair command #2, repairing keyspace my_ks with repair options (parallelism: parallel, primary range: true, incremental: false, job threads: 1, ColumnFamilies: [my_tbl], dataCenters: [], hosts: [], # of ranges: 256)
INFO [HANDSHAKE-/10.32.146.85] 2020-07-13 14:04:23,460 OutboundTcpConnection.java:515 - Handshaking version with /10.32.146.85
DEBUG [GossipStage:1] 2020-07-13 14:04:23,716 FailureDetector.java:456 - Ignoring interval time of 2000838464 for /10.22.38.27
DEBUG [GossipStage:1] 2020-07-13 14:04:23,716 FailureDetector.java:456 - Ignoring interval time of 2000923736 for /10.22.38.68
DEBUG [GossipStage:1] 2020-07-13 14:04:23,815 FailureDetector.java:456 - Ignoring interval time of 2100571952 for /10.32.253.232
DEBUG [GossipStage:1] 2020-07-13 14:04:25,247 FailureDetector.java:456 - Ignoring interval time of 2429005356 for /10.32.144.198
I am using Cassandra 3.9.
Edit: I see below logs with trace enabled:
INFO [HANDSHAKE-/10.32.168.76] 2020-07-18 02:12:41,253 OutboundTcpConnection.java:515 - Handshaking version with /10.32.168.76
INFO [HANDSHAKE-/10.32.142.195] 2020-07-18 02:12:42,260 OutboundTcpConnection.java:515 - Handshaking version with /10.32.142.195
INFO [HANDSHAKE-/10.32.144.198] 2020-07-18 02:12:42,260 OutboundTcpConnection.java:515 - Handshaking version with /10.32.144.198
ERROR [RepairTracePolling] 2020-07-18 02:12:45,836 CassandraDaemon.java:226 - Exception in thread Thread[RepairTracePolling,5,RMI Runtime]
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.repair.RepairRunnable$4.runMayThrow(RepairRunnable.java:412) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.9.jar:3.9]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_72]
来源:https://stackoverflow.com/questions/62886719/cassandra-nodetool-repair-is-getting-stuck-sometimes