Writing Spark Structure Streaming data into Cassandra

£可爱£侵袭症+ 提交于 2019-12-29 09:21:46

问题


I want to write Structure Streaming Data into Cassandra using Pyspark API.

My data flow is like below:

Nifi -> Kafka -> Spark Structure Streaming -> Cassandra

I have tried below way:

query = df.writeStream\
  .format("org.apache.spark.sql.cassandra")\
  .option("keyspace", "demo")\
  .option("table", "test")\
  .start()

But getting below error message: "org.apache.spark.sql.cassandra" does not support streaming write.

Also another approach I have tried: [Source - DSE 6.0 Administrator Guide]

query = df.writeStream\
   .cassandraFormat("test", "demo")\
   .start()

But got exception: AttributeError: 'DataStreamWriter' object has no attribute 'cassandraFormat'

Can anyone give me some idea how I can proceed further ?

Thanks in advance.


回答1:


After upgrading DSE 6.0 (latest version) I am able to write structured streaming data into Cassandra. [Spark 2.2 & Cassandra 3.11]

Reference Code:

query = fileStreamDf.writeStream\
 .option("checkpointLocation", '/tmp/check_point/')\
 .format("org.apache.spark.sql.cassandra")\
 .option("keyspace", "analytics")\
 .option("table", "test")\
 .start()

DSE documentation URL: https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/structuredStreaming.html




回答2:


This answer is for writing data to Cassandra, not DSE (which supports Structured Streaming for storing data)

For Spark 2.4.0 and higher, you can use the foreachBatch method, which allows you to use the Cassandra batch data writer provided by the Spark Cassandra Connector to write the output of every micro-batch of the streaming query to Cassandra:

import org.apache.spark.sql.cassandra._

df.writeStream
  .foreachBatch { (batchDF, _) => 
    batchDF
     .write
     .cassandraFormat("tableName", "keyspace")
     .mode("append")
     .save
  }.start

For Spark versions lower than 2.4.0, you need to implement a foreach sink.

import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.driver.core.querybuilder.QueryBuilder
import com.datastax.driver.core.Statement
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row

class CassandraSink(sparkConf: SparkConf) extends ForeachWriter[Row] {
    def open(partitionId: Long, version: Long): Boolean = true

    def process(row: Row) = {
      def buildStatement: Statement =
        QueryBuilder.insertInto("keyspace", "tableName")
          .value("key", row.getAs[String]("value"))
      CassandraConnector(sparkConf).withSessionDo { session =>
        session.execute(buildStatement)
      }
    }

    def close(errorOrNull: Throwable) = Unit
}

And then you can use the foreach sink as follows:

df.writeStream
 .foreach(new CassandraSink(spark.sparkContext.getConf))
 .start



回答3:


Not much you can do here other than:

  • Following (and voting for) corresponding JIRA.
  • Implementing required functionality and opening PR yourself.

Other than that, you can just create use foreach sink and write directly.



来源:https://stackoverflow.com/questions/50037285/writing-spark-structure-streaming-data-into-cassandra

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!