dstream

Programatically creating dstreams in apache spark

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-17 04:33:29
问题 I am writing some self contained integration tests around Apache Spark Streaming. I want to test that my code can ingest all kinds of edge cases in my simulated test data. When I was doing this with regular RDDs (not streaming). I could use my inline data and call "parallelize" on it to turn it into a spark RDD. However, I can find no such method for creating destreams. Ideally I would like to call some "push" function once in a while and have the tupple magically appear in my dstream. ATM I

Pyspark filter operation on Dstream

霸气de小男生 提交于 2019-12-25 09:17:29
问题 I have been trying to extend the network word count to be able to filter lines based on certain keyword I am using spark 1.6.2 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonStreamingNetworkWordCount") ssc = StreamingContext(sc, 5) lines = ssc

Pyspark - Transfer control out of Spark Session (sc)

萝らか妹 提交于 2019-12-13 07:01:20
问题 This is a follow up question on Pyspark filter operation on Dstream To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job. What I have tried: from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext def counts(): counter += 1 print(counter.value) if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>",

How to Combine two Dstreams using Pyspark (similar to .zip on normal RDD)

不打扰是莪最后的温柔 提交于 2019-12-13 02:28:16
问题 I know that we can combine(like cbind in R) two RDDs as below in pyspark: rdd3 = rdd1.zip(rdd2) I want to perform the same for two Dstreams in pyspark. Is it possible or any alternatives? In fact, I am using a MLlib randomforest model to predict using spark streaming. In the end, I want to combine the feature Dstream & prediction Dstream together for further downstream processing. Thanks in advance. -Obaid 回答1: In the end, I am using below. The trick is using "native python map" along with

Cartesian of DStream

拈花ヽ惹草 提交于 2019-12-12 05:37:42
问题 I use Spark cartesian function to to generate a list N pairs of values. I then map over these values to generate a distance metric between each of the users : val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users) cartesianUsers.map(m => manDistance(m._1, m._2)) This works as expected. Using Spark Streaming library I create a DStream and then map over it : val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream

Spark : get Multiple DStream out of a single DStream

房东的猫 提交于 2019-12-12 02:57:12
问题 Is is possible to get multiple DStream out of a single DStream in spark. My use case is follows: I am getting Stream of log data from HDFS file. The log line contains an id (id=xyz). I need to process log line differently based on the id. So I was trying to different Dstream for each id from input Dstream. I couldnt find anything related in documentation. Does anyone know how this can be achieved in Spark or point to any link for this. Thanks 回答1: You cannot Split multiple DStreams from

How to get the cartesian product of two DStream in Spark Streaming with Scala?

穿精又带淫゛_ 提交于 2019-12-11 06:13:34
问题 I have two DStreams. Let A:DStream[X] and B:DStream[Y] . I want to get the cartesian product of them, in other words, a new C:DStream[(X, Y)] containing all the pairs of X and Y values. I know there is a cartesian function for RDDs. I was only able to find this similar question but it's in Java and so does not answer my question. 回答1: The Scala equivalent of the linked question's answer (ignoring Time v3 , which isn't used there) is A.transformWith(B, (rddA: RDD[X], rddB: RDD[Y]) => rddA

How to solve Type mismatch issue (expected: Double, actual: Unit)

喜夏-厌秋 提交于 2019-12-11 05:38:58
问题 Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit) . I tried many different ways to solve this issue, but still without success. Any ideas? def calculateRMSE(output: DStream[(Double, Double)]): Double = { val summse = output.foreachRDD { rdd => rdd.map { case pair: (Double, Double) => val err = math.abs(pair._1 - pair._2); err*err }.reduce(_ + _) } // math.sqrt(summse)

For each RDD in a DStream how do I convert this to an array or some other typical Java data type?

谁说我不能喝 提交于 2019-12-07 03:10:13
问题 I would like to convert a DStream into an array, list, etc. so I can then translate it to json and serve it on an endpoint. I'm using apache spark, injecting twitter data. How do I preform this operation on the Dstream statuses ? I can't seem to get anything to work other than print() . import org.apache.spark._ import org.apache.spark.SparkContext._ import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming.StreamingContext._ import

How to use feature extraction with DStream in Apache Spark

ⅰ亾dé卋堺 提交于 2019-12-03 21:34:39
I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords. I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit. So far I put together something like that: def extractKeywords(stream: DStream[Data]): Unit = { val spark: SparkSession = SparkSession.builder.getOrCreate val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData val