spark-streaming | 易学教程

How to use foreachRDD in legacy Spark Streaming

阅读更多关于 How to use foreachRDD in legacy Spark Streaming

问题 I am getting exception while using foreachRDD for my CSV data processing. Here is my code case class Person(name: String, age: Long) val conf = new SparkConf() conf.setMaster("local[*]") conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true") val ssc = new StreamingContext(conf, Seconds(10)) val smDstream=ssc.textFileStream("file:///home/sa/testFiles") smDstream.foreachRDD((rdd,time) => { val peopleDF = rdd.map(_.split(",")).map(attributes => Person(attributes(0)

Spark Streaming reduceByKeyAndWindow for moving average calculation

阅读更多关于 Spark Streaming reduceByKeyAndWindow for moving average calculation

问题 I need to calculate a moving average from a kinesis stream of data. I will have a sliding window size and slide as inputs and need to calculate the moving average and plot it. I understand how to use reduceByKeyAndWindow from the docs to get a rolling sum. I understand how to get the counts per window as well. I am not clear on how to use these to get the average. Nor am I sure how to define an average calculator function in the reduceByKeyAndWindow. Any help would be appreciated. Sample code

Python Spark Streaming example with textFileStream does not work. Why?

阅读更多关于 Python Spark Streaming example with textFileStream does not work. Why?

问题 I use spark 1.3.1 and Python 2.7 It is my first experience with Spark Streaming. I try example of code, which reads data from file using spark streaming. This is link to example: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py My code is the following: conf = (SparkConf() .setMaster("local") .setAppName("My app") .set("spark.executor.memory", "1g")) sc = SparkContext(conf = conf) ssc = StreamingContext(sc, 1) lines = ssc.textFileStream('..

How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

阅读更多关于 How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

问题 If I set Seconds(1) for the batch time in StreamingContext , like this: val ssc = new StreamingContext(sc, Seconds(1)) 3 seconds will receive the 3 seconds of data, but I only need the first seconds of data, I can discard the next 2 seconds of data. So can I spend 3 seconds to process only first second of data? 回答1: You can do this via updateStateByKey if you keep track of counter, for example like below: import org.apache.spark.SparkContext import org.apache.spark.streaming.dstream

Databricks - Structured Streaming: Console Format Displaying Nothing

阅读更多关于 Databricks - Structured Streaming: Console Format Displaying Nothing

问题 I am learning Structured Streaming with Databricks and I'm struggling with the DataStreamWriter console mode. My program: Simulates the streaming arrival of files to the folder "monitoring_dir" (one new file is transferred from "source_dir" each 10 seconds). Uses a DataStreamReader to populate the Unbounded DataFrame "inputUDF" with the content of each new file. Uses a DataStreamWriter to output the new rows of "inputUDF" to a valid sink. Whereas the program works when choosing to use a File

Spark : How to speedup foreachRDD?

阅读更多关于 Spark : How to speedup foreachRDD?

问题 We have a Spark streaming application which ingests data @10,000/ sec ... We use the foreachRDD operation on our DStream( since spark doesn't execute unless it finds the output operation on DStream) so we have to use the foreachRDD output operation like this , it takes upto to 3 hours ...to write a singlebatch of data (10,000) which is slow CodeSnippet 1: requestsWithState.foreachRDD { rdd => rdd.foreach { case (topicsTableName, hashKeyTemp, attributeValueUpdate) => { val client = new

How to update a ML model during a spark streaming job without restarting the application?

阅读更多关于 How to update a ML model during a spark streaming job without restarting the application?

问题 I've got a Spark Streaming job whose goal is to : read a batch of messages predict a variable Y given these messages using a pre-trained ML pipeline The problem is, I'd like to be able to update the model used by the executors without restarting the application. Simply put, here's what it looks like : model = #model initialization def preprocess(keyValueList): #do some preprocessing def predict(preprocessedRDD): if not preprocessedRDD.isEmpty(): df = #create df from rdd df = model.transform

Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

阅读更多关于 Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

问题 I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame. Is this scenario not supported? Spark versions: 2.2.0 and 2.3.0 Libraries used: json-20140107 rxnetty-0.4.20 azure-documentdb-1.14.0 azure-documentdb-rx-0.9.0-rc2 azure-cosmosdb-spark_2.2.0_2.11-1.0.0 rxjava-1.3.0 azure-eventhubs

Caching DStream in Spark Streaming

阅读更多关于 Caching DStream in Spark Streaming

问题 I have a Spark streaming process which reads data from kafka, into a DStream. In my pipeline I do two times (one after another): DStream.foreachRDD( transformations on RDD and inserting into destination). (each time I do different processing and insert data to different destination). I was wondering how would DStream.cache, right after I read data from Kafka work? Is it possible to do it? Is the process now actually reading data two times from Kafka? Please keep in mind, that it is not

kafka streaming behaviour for more than one partition

阅读更多关于 kafka streaming behaviour for more than one partition

问题 I am consuming from Kafka topic. This topic has 3 partitions. I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that). Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size) Now , the output is