spark-streaming

Can Apache Spark merge several similar lines into one line?

不想你离开。 提交于 2019-12-11 08:04:01
问题 I am totaly new with Apache Spark, therefore, I am very sorry if my question seems to be naive but I did not find a clear answer on internet. Here is the context of my problem: I want to retrieve json input data from an Apache Kafka server. The format is as follows: {"deviceName":"device1", "counter":125} {"deviceName":"device1", "counter":125} {"deviceName":"device2", "counter":88} {"deviceName":"device1", "counter":125} {"deviceName":"device2", "counter":88} {"deviceName":"device1",

ERROR Error cleaning broadcast Exception [duplicate]

痞子三分冷 提交于 2019-12-11 07:36:10
问题 This question already has answers here : What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark [duplicate] (4 answers) Closed 2 years ago . I get the following error while running my spark streaming application, we have a large application running multiple stateful (with mapWithState) and stateless operations. It's getting difficult to isolate the error since spark itself hangs and the only error we see is in the spark log and

Spark Streaming is not streaming,but waits showing consumer-config values

好久不见. 提交于 2019-12-11 07:33:23
问题 I am trying to stream data using spark-streaming-kafka-0-10_2.11 artifact and 2.1.1 version along with spark-streaming_2.11 version 2.1.1 and kafka_2.11(0.10.2.1 version). When I start the program, Spark is not streaming data and neither it is throwing any error. Below is my code, dependencies, and response. Dependencies: "org.apache.kafka" % "kafka_2.11" % "0.10.2.1", "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.1", "org.apache.spark" % "spark-streaming_2.11" % "2.1.1" Code:

KafkaConsumer is not safe for multi-threaded access from SparkStreaming

别说谁变了你拦得住时间么 提交于 2019-12-11 07:17:20
问题 I have set up multiple streams reading from different kafka topics: for(topic <- topics) { val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String]( Array(topicConfig.srcTopic), kafkaParameters() ) ) stream.map(...).reduce(...)... } kafkaParameters basically has the needed config: "bootstrap.servers" -> bootstrapServers, "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" ->

How to extract timed-out sessions using mapWithState

让人想犯罪 __ 提交于 2019-12-11 07:16:18
问题 I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out. This was my old code: val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => { val parsed = Utils.parseJSON(eventRecord) val member_id = parsed.getOrElse("member_id", "") val timestamp =

Spark Standalone: TransportRequestHandler: Error while invoking RpcHandler - when starting workers on different machine/VMs

岁酱吖の 提交于 2019-12-11 06:47:16
问题 I am totally new at this, so please pardon for obvious mistakes if any. Exact errors: At Slave: INFO TransportClientFactory: Successfully created connection to /10.2.10.128:7077 after 69 ms (0 ms spent in bootstraps) WARN Worker: Failed to connect to master 10.2.10.128:7077 At Master: INFO Master: I have been elected leader! New state: ALIVE ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7626954048526157749 Little background & Things I have tried/ taken

2 spark stream job with same consumer group id

自作多情 提交于 2019-12-11 06:46:23
问题 I am trying to experiment on consumer groups Here is my code snippet public final class App { private static final int INTERVAL = 5000; public static void main(String[] args) throws Exception { Map<String, Object> kafkaParams = new HashMap<>(); kafkaParams.put("bootstrap.servers", "xxx:9092"); kafkaParams.put("key.deserializer", StringDeserializer.class); kafkaParams.put("value.deserializer", StringDeserializer.class); kafkaParams.put("auto.offset.reset", "earliest"); kafkaParams.put("enable

Reading binaryFile with Spark Streaming

ⅰ亾dé卋堺 提交于 2019-12-11 06:39:50
问题 Does any one know how to setup the ` streamingContext.fileStream [KeyClass, ValueClass, InputFormatClass] (dataDirectory) to actually consume binary files. Where can I find all the inputformatClass ? The documentation give no links for that. I imagine that the ValueClass is related to the inputformatClass somehow. In the non-streaming version using the method binaryfiles, I can get ByteArrays for each files. Is there a way i can get the same with sparkStreaming ? If not where can i find those

How to get the cartesian product of two DStream in Spark Streaming with Scala?

穿精又带淫゛_ 提交于 2019-12-11 06:13:34
问题 I have two DStreams. Let A:DStream[X] and B:DStream[Y] . I want to get the cartesian product of them, in other words, a new C:DStream[(X, Y)] containing all the pairs of X and Y values. I know there is a cartesian function for RDDs. I was only able to find this similar question but it's in Java and so does not answer my question. 回答1: The Scala equivalent of the linked question's answer (ignoring Time v3 , which isn't used there) is A.transformWith(B, (rddA: RDD[X], rddB: RDD[Y]) => rddA

Offsets committed out of order with Spark DataSource API V2 Hive Streaming Sink

℡╲_俬逩灬. 提交于 2019-12-11 05:49:25
问题 I am using sink to save Spark(2.3) Structured Streaming DataFrame into Hive table with a custom sink implementation. The code is as follows. val df = spark.readStream.format("socket").option("host", "localhost").option("port", 19191).load().as[String] val query = df.map { s => val records = s.split(",") assert(records.length >= 4) (records(0).toInt, records(1), records(2), records(3)) } query.selectExpr("_1 as eid", "_2 as name", "_3 as salary", "_4 as designation"). writeStream. format("hive