spark-streaming

DStream all identical keys should be processed sequentially

夙愿已清 提交于 2019-12-12 04:29:03
问题 I have dstream of (Key,Value) type. mapped2.foreachRDD(rdd => { rdd.foreachPartition(p => { p.foreach(x => { } )}) }) I need to get assured that all items with identical keys are processed in one partition and by one core..so actually there are processed sequentially.. How to do this? Can I use GroupBykey which is inefficient? 回答1: You can use PairDStreamFunctions.combineByKey : import org.apache.spark.HashPartitioner import org.apache.spark.streaming.dstream.DStream /** * Created by Yuval

Iterate across columns in spark dataframe and calculate min max value

痴心易碎 提交于 2019-12-12 04:12:15
问题 I want to iterate across the columns of dataframe in my Spark program and calculate min and max value. I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe. I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection. val parquetRDD = spark.read.parquet("filename.parquet") parquetRDD.collect.foreach ({ i => parquetRDD

Spark Streaming custom receiver in “Python” (receive UDP over socket)

删除回忆录丶 提交于 2019-12-12 03:55:49
问题 The programming guide mentions that in Spark streaming, to develop custom receivers , it can be done in Java or Scala. http://spark.apache.org/docs/latest/streaming-custom-receivers.html However, I am wondering if a custom receiver can also be developed in Python . Specifically what I am looking for is to receive UDP data stream over a socket in Python . I.e streaming data in UDP format from a device to a mentioned IP address and port number and I want to receive it in Spark streaming. If

MessageHandler in KafkaUtils010 SparkStreaming

扶醉桌前 提交于 2019-12-12 03:34:17
问题 I wanted to group per topic or know from which topic a message comes when applying: val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String]( Array(topicConfig.srcTopic), kafkaParameters(BOOTSTRAP_SERVERS,"kafka_test_group_id)) ) ) However in the latest API kafka010 does not seem to support a message handler as in previous versions. Any idea on how to get the topic? My goal is to consume from N topics process them (in different ways

Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

有些话、适合烂在心里 提交于 2019-12-12 03:28:02
问题 I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1 at org.apache.spark.util.Utils$.tryOrIOException(Utils

Null value in spark streaming from Kafka

≯℡__Kan透↙ 提交于 2019-12-12 03:28:01
问题 I have a simple program because I'm trying to receive data using kafka . When I start a kafka producer and I send data, for example: "Hello", I get this when I print the message: (null, Hello) . And I don't know why this null appears. Is there any way to avoid this null? I think it's due to Tuple2<String, String> , the first parameter, but I only want to print the second parameter. And another thing, when I print that using System.out.println("inside map "+ message); it does not appear any

Add streaming Listener to Stop after first Iteration

我是研究僧i 提交于 2019-12-12 03:24:57
问题 I need to know how implement Streaming Listener to Stop a Streaming application. I need a small example with Dependencies, imports and implement Codes. Similar to detecting connection lost in spark streaming Can you Help me? 来源: https://stackoverflow.com/questions/36874021/add-streaming-listener-to-stop-after-first-iteration

how to use spark streaming + kafka with streamingListener

我的梦境 提交于 2019-12-12 03:24:23
问题 i have one situation here. I want that my aplication connect one time to kafka, read the offset, make an action and then stop the application. I was reading about StreamingListener for detect when the first iteration occurs. i don´t know how use StreamingListener to stop mi application. can you help me? I am using spark 1.4 Example code bellow: val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) lines.foreachRDD( rdd => { rdd.saveAsTextFile("......") }) sys

Spark : get Multiple DStream out of a single DStream

房东的猫 提交于 2019-12-12 02:57:12
问题 Is is possible to get multiple DStream out of a single DStream in spark. My use case is follows: I am getting Stream of log data from HDFS file. The log line contains an id (id=xyz). I need to process log line differently based on the id. So I was trying to different Dstream for each id from input Dstream. I couldnt find anything related in documentation. Does anyone know how this can be achieved in Spark or point to any link for this. Thanks 回答1: You cannot Split multiple DStreams from

Are shared variables supported in Spark Streaming?

廉价感情. 提交于 2019-12-12 02:55:37
问题 Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators, are they supported in Spark Streaming?? 回答1: Yes, you can use them as you'd normally do via SparkContext . The only difference is that you need to get it from SparkStreamingContext : val sparkConf = new SparkConf().setAppName("MyApp") val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.sparkContext.broadcast(myValue) With spark streaming you might want to update that