I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way
I currently read from cassandra using asyncIO in flink 1.3. Here is the documentation on it:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html (where it has DatabaseClient, you will use the com.datastax.drive.core.Cluster instead)
Let me know if you need a more in depth example for using it to read from cassandra specifically, but I unfortunately can only provide an example in java.
EDIT 1
Here is an example of the code I am using for reading from Cassandra with flink's Async I/O. I am still working on identifying and fixing an issue where for some reason (without going deep into it) for large amounts of data being returned by a single query, the async data stream's timeout is triggered even though it looks to be returned fine by Cassandra and well before the timeout time. But assuming that is just a bug with other stuff I am doing and not because of this code, this should work fine for you (and has worked fine for months for me as well):
public class GenericCassandraReader extends RichAsyncFunction<CustomInputObject, ResultSet> {
private final Properties props;
private Session client;
public GenericCassandraReader(Properties props) {
super();
this.props = props;
}
@Override
public void open(Configuration parameters) throws Exception {
client = Cluster.builder()
.addContactPoint(props.cassandraUrl)
.withPort(props.cassandraPort)
.build()
.connect(props.cassandraKeyspace);
}
@Override
public void close() throws Exception {
client.close();
}
@Override
public void asyncInvoke(final CustomInputObject customInputObject, final AsyncCollector<ResultSet> asyncCollector) throws Exception {
String queryString = "select * from table where fieldToFilterBy='" + customInputObject.id() + "';";
ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(queryString);
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(ResultSet resultSet) {
asyncCollector.collect(Collections.singleton(resultSet));
}
public void onFailure(Throwable t) {
asyncCollector.collect(t);
}
});
}
}
Again, sorry for the delay. Was hoping to have the bug resolved so I could be certain, but figured at this point just having some reference would be better than nothing.
EDIT 2
So we came to finally determine that the issue isn't with the code, but with the network throughput. Lot of bytes trying to come through a pipe that isn't large enough to handle it, stuff starts backing up, some start trickling in but (thanks to datastax cassandra driver's QueryLogger we could see this) the time it took to receive the result of each query started climbing to 4 seconds, then 6, then 8 and so on.
TL;DR, code is fine, just be aware that if you experience timeoutExceptions from Flink's asyncWaitOperator, it could be a network issue.
Edit 2.5
Also realized that it might be beneficial to mention that because of the network latency issue, we ended up moving to using a RichMapFunction that holds the data we were reading from cassandra in state. So the job just keeps track of all the records that come through it instead of having to read from the table each time a new record comes through to get all that are in there.