I am getting Cassandra timeouts using the Phantom-DSL with the Datastax Cassandra driver. However, Cassandra does not seem to be overloaded. Below is the exception I get:
com.datastax.driver.core.exceptions.OperationTimedOutException: [node-0.cassandra.dev/10.0.1.137:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:766)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1267)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:588)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:662)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:385)
at java.lang.Thread.run(Thread.java:745)
And here are the statistics I get from the Cassandra Datadog connector over this time period:
You can see our read rate (per second) on the top-center graph. Our CPU and memory usage are very low.
Here is how we are configuring the Datastax driver:
val points = ContactPoints(config.cassandraHosts)
.withClusterBuilder(_.withSocketOptions(
new SocketOptions()
.setReadTimeoutMillis(config.cassandraNodeTimeout)
))
.withClusterBuilder(_.withPoolingOptions(
new PoolingOptions()
.setConnectionsPerHost(
HostDistance.LOCAL,
2,
2
)
.setConnectionsPerHost(
HostDistance.REMOTE,
2,
2
)
.setMaxRequestsPerConnection(
HostDistance.LOCAL,
2048
)
.setMaxRequestsPerConnection(
HostDistance.REMOTE,
2048
)
.setPoolTimeoutMillis(10000)
.setNewConnectionThreshold(
HostDistance.LOCAL,
1500
)
.setNewConnectionThreshold(
HostDistance.REMOTE,
1500
)
))
Our nodetool cfstats
looks like this:
$ nodetool cfstats alexandria_dev.match_sums
Keyspace : alexandria_dev
Read Count: 101892
Read Latency: 0.007479115141522397 ms.
Write Count: 18721
Write Latency: 0.012341060840767052 ms.
Pending Flushes: 0
Table: match_sums
SSTable count: 0
Space used (live): 0
Space used (total): 0
Space used by snapshots (total): 0
Off heap memory used (total): 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 15328
Memtable cell count: 15332
Memtable data size: 21477107
Memtable off heap memory used: 0
Memtable switch count: 0
Local read count: 17959
Local read latency: 0.015 ms
Local write count: 15332
Local write latency: 0.013 ms
Pending flushes: 0
Percent repaired: 100.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 0
Bloom filter off heap memory used: 0
Index summary off heap memory used: 0
Compression metadata off heap memory used: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 1.0
Maximum live cells per slice (last five minutes): 1
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0
When we ran cassandra-stress
, we didn't experience any issues: we were getting a steady 50k reads per second, as expected.
Cassandra has this error whenever I make my queries:
INFO [Native-Transport-Requests-2] 2017-03-10 23:59:38,003 Message.java:611 - Unexpected exception during request; channel = [id: 0x65d7a0cd, L:/10.0.1.98:9042 ! R:/10.0.1.126:35536]
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
Why are we getting timeouts?
EDIT: I had the wrong dashboard uploaded. Please see the new image.
2 questions that'll be helpful:
- What's your timeout set to
- What's the query?
Now some clarification on where I think you're going wrong here:
- the resolution is too coarse to diagnose a single query, I could have a server doing nothing, do one expensive query that pegs some bottleneck for the entire time and on that scale look like nothing was bottlenecked, run iostat -x 1 on the servers at the same time and you may find something drastically different than what the charts say at that resolution.
- If I'm looking at your CPU usage chart correctly there it looks like 50% usage. On modern servers that's actually fully busy because of hyperthreading and how aggregate CPU usage works see https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/
I suggest tracing the problematic query to see what cassandra was doing.
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tracing_r.html
Open cql shell, type TRACING ON
and execute your query. If everything seems fine, there is a chance that this problem happens occasionally, in which case I'd suggest tracing the queries using nodetool settraceprobablilty for some time, until you manage to catch the problem.
You enable it on each node separately using nodetool settraceprobability <param>
where param is the probability (between 0 and 1) that the query will get traced. Careful: this WILL cause increased load, so start with a very low number and go up.
If this problem is occasional there is a chance that this might be caused by long garbage collections, in which case you need to analyse the GC logs. Check how long your GC's are.
edit: just to be clear, if this problem is caused by GC's you will NOT see it with tracing. So first check your GC's, and if its not the problem then move on to tracing.
来源:https://stackoverflow.com/questions/42708258/cassandra-timeouts-with-no-cpu-usage