What is best approach to join data in spark streaming application?

问题

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?

We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.

But have one fundamental question regarding the efficiency in the below scenario.

For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.

i.e.

Dataset<Row> streamingDataSet = //kafka read dataset

Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.

To look up data i need to join above datasets

i.e.

Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)

process further the joinDataSet to implement the business logic ...

In the above scenario, my understanding is ,for each record received from kafka stream it would query the C* table i.e. data base call.

Does not it take huge time and network bandwidth if C* table consists billions of records? What should be the approach/procedure to be followed to improve look up C* table ?

What is the best solution in this scenario ? I CAN NOT load once from C* table and look up as the data keep on adding to C* table ... i.e. new look ups might need newly persisted data.

How to handle this kind of scenario? any advices plzz..

回答1:

If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.

I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:

CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
     trdd.joinWithCassandraTable("test", "jtest",
     someColumns("id", "v"), someColumns("id"),
     mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

来源：https://stackoverflow.com/questions/59491295/what-is-best-approach-to-join-data-in-spark-streaming-application

标签

apache-spark

cassandra

apache-spark-sql

spark-streaming

datastax-enterprise