Joining streaming data on table data and update the table as the stream receives , is it possible?

问题

I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have scenario , where I need join streaming data with C*/Cassandra table data.

If record/join found I need to copy the existing C* table record to another table_bkp and update the actual C* table record with latest data.

As the streaming data come in I need to perform this. Is this can be done using spark-sql steaming ? If so , how to do it ? any caveats to take care ?

For each batch how to get C* table data freshly ?

What is wrong I am doing here

I have two tables as below "master_table" & "backup_table"

table kspace.master_table(
    statement_id int,
    statement_flag text,
    statement_date date,
    x_val double,
    y_val double,
    z_val double,
    PRIMARY KEY (( statement_id ), statement_date)
) WITH CLUSTERING ORDER BY ( statement_date DESC );

table kspace.backup_table(
    statement_id int,
    statement_flag text,
    statement_date date,
    x_val double,
    y_val double,
    z_val double,
    backup_timestamp timestamp,
    PRIMARY KEY ((statement_id ), statement_date, backup_timestamp )
) WITH CLUSTERING ORDER BY ( statement_date DESC,   backup_timestamp DESC);


Each streaming record would have "statement_flag" which might be "I" or "U".
If record with "I" comes we directly insert into "master_table".
If record with "U" comes , need to check if there is any record for given ( statement_id ), statement_date in "master_table".
     If there is record in "master_table" copy that one to "backup_table" with current timestamp i.e. backup_timestamp
     Update the record in "master_table" with latest record.

To achieve the above I am doing PoC/Code like below

Dataset<Row> baseDs = //streaming data from topic
Dataset<Row> i_records = baseDs.filter(col("statement_flag").equalTo("I"));
Dataset<Row> u_records = baseDs.filter(col("statement_flag").equalTo("U"));

String keyspace="kspace";
String master_table = "master_table";
String backup_table = "backup_table";


Dataset<Row> cassandraMasterTableDs = getCassandraTableData(sparkSession, keyspace , master_table);

writeDfToCassandra( baseDs.toDF(), keyspace, master_table);


u_records.createOrReplaceTempView("u_records");
cassandraMasterTableDs.createOrReplaceTempView("persisted_records");

Dataset<Row> joinUpdatedRecordsDs =  sparkSession.sql(
            " select p.statement_id, p.statement_flag, p.statement_date,"
            + "p.x_val,p.y_val,p.z_val "
            + " from persisted_records as p "
            + "join u_records as u "
            + "on p.statement_id = u.statement_id  and p.statement_date = u.statement_date");



Dataset<Row> updated_records =   joinUpdatedRecordsDs
                            .withColumn("backup_timestamp",current_timestamp());

updated_records.show(); //Showing correct results 


writeDfToCassandra( updated_records.toDF(), keyspace, backup_table);  // But here/backup_table copying the latest "master_table" records

Sample data

For first record with "I" flag

master_table

backup_table

For second record with "U" flag , i.e. same as earlier except "y_val" column data

master_table

backup_table

Expected

But actual table data is

Question:

Till show the dataframe(updated_records) showing correct data. But when I insert same dataframe(updated_records) into table , C* backup_table data shows exactly same as latest record of master_table , but which suppose to have earlier record of master_table.

  updated_records.show(); //Showing correct results 


    writeDfToCassandra( updated_records.toDF(), keyspace, backup_table);  // But here/backup_table copying the latest "master_table" records

So what am I doing wrong in above program code ?

回答1:

There are several ways to to do this with various levels of performance depending on how much data you need to check.

For example, if you are only looking up data by partition key the most efficient thing to do is to use joinWithCassandraTable on the Dstream. For every batch this will extract records matching the incoming partition keys. In structured streaming this would happen automatically with the correctly written SQL join and DSE. If DSE was not in use it would fully scan the table with each batch.

If instead you require the whole table for each batch, joining the DStream batch with a CassandraRDD will cause the RDD to be re-read completely on every batch. This is much more expensive if the entire table is not being re-written.

If you are only updating records without checking their previous values, it is sufficient to just write the incoming data directly to the C* table. C* uses upserts and last write win behaviors, and will just overwrite the previous values if they existed.

来源：https://stackoverflow.com/questions/58098268/joining-streaming-data-on-table-data-and-update-the-table-as-the-stream-receives

标签

apache-spark

apache-spark-sql

spark-streaming

datastax

cassandra-3.0