问题

I'm using

Cassandra v2.1.12
Spark v1.4.1
Scala 2.10

and cassandra is listening on

rpc_address:127.0.1.1
rpc_port:9160

For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job

sc = SparkContext(conf=conf)
stream=StreamingContext(sc,4)
map1={'topic_name':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)

And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

Same way, I want spark streaming to listen to cassandra and output the contents of the specified table every say 4 seconds.

How to convert the above streaming code to make it work with cassandra instead of kafka?

The non-streaming solution

I can obviously keep running the query in an infinite loop but that's not true streaming right?

spark job:

from __future__ import print_function
import time
import sys

from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming import *

sc = SparkContext(appName="sparkcassandra")
while(True):
    time.sleep(5)
    sqlContext = SQLContext(sc)
    stream=StreamingContext(sc,4)
    lines = stream.socketTextStream("127.0.1.1", 9160)
    sqlContext.read.format("org.apache.spark.sql.cassandra")\
                 .options(table="users", keyspace="keyspace2")\
                 .load()\
                 .show()

run like this

sudo ./bin/spark-submit --packages \
datastax:spark-cassandra-connector:1.4.1-s_2.10 \
examples/src/main/python/sparkstreaming-cassandra2.py

and I get the table values which rougly looks like

lastname|age|city|email|firstname

So what's the correct way of "streaming" the data from cassandra?

回答1:

Currently the "Right Way" to stream data from C* is not to Stream Data from C* :) Instead it usually makes much more sense to have your message queue (like Kafka) in front of C* and Stream off of that. C* doesn't easily support incremental table reads although this can be done if the clustering key is based on insert time.

If you are interested in using C* as a streaming source be sure to check out and comment on https://issues.apache.org/jira/browse/CASSANDRA-8844 Change Data Capture

Which is most likely what you are looking for.

If you are actually just trying to read the full table periodically and do something you may be best off with just a cron job launching a batch operation as you really have no way of recovering state anyway.

回答2:

Currently Cassandra is not natively supported as a streaming source in Spark 1.6, you must implement a custom receiver for your own case(listen to cassandra and output the contents of the specified table every say 4 seconds.).

Please refer to the implementation guide:

Spark Streaming Custom Receivers

来源：https://stackoverflow.com/questions/34993290/how-to-connect-spark-streaming-with-cassandra

标签

apache-spark