spark-streaming

How can I perform an operation on two windowed DStreams with an offset?

你说的曾经没有我的故事 提交于 2019-12-11 04:15:19
问题 I'd like to compute the difference (by key) of two DStreams with different windows. This could be accomplished with a join. However, I want to have an offset between the DStreams. One way to do this would be to drop N windows of one of the DStreams, but I don't know how to do that either. 来源: https://stackoverflow.com/questions/26445407/how-can-i-perform-an-operation-on-two-windowed-dstreams-with-an-offset

Why does Spark Streaming fail with ClassCastException with repartitioned dstream when accessing offsets?

喜你入骨 提交于 2019-12-11 04:14:55
问题 In my Spark application I create a DStream from a Kafka topic in the following way: KafkaUtils .createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder, (String, Array[Byte])]( streamingContext, kafkaParams, offset.get, { message: MessageAndMetadata[String, Array[Byte]] => (message.key(), message.message()) } ) and later, i commit offset to Kafka topic using asInstanceOf function: directKafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] //

spark 2.2 struct streaming foreach writer jdbc sink lag

依然范特西╮ 提交于 2019-12-11 04:07:56
问题 i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second . when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop . the jdbc sink class(stand alone class file): class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache

why there are so many tasks in my spark streaming job

南楼画角 提交于 2019-12-11 03:05:24
问题 I wonder why there are so many task number in my spark streaming job ? and it becomes bigger and bigger... after 3.2 hours' running, it grow to 120020... and after one day's running, it will grow to one million... why? 回答1: I would strongly recommend that you check the parameter spark.streaming.blockInterval , which is a very important one. By default it's 0.5 seconds, i.e. create one task every 0.5 seconds. So maybe you can try to increase the spark.streaming.blockInterval to be 1min or

Gradual Increase in old generation heap memory

*爱你&永不变心* 提交于 2019-12-11 03:03:24
问题 I am facing a very strange issue in spark streaming. I am using spark 2.0.2 , number of nodes 3, number of executors 3 {1 receiver and 2 processor}, memory per executor 2 GB, cores per executor 1. The batch interval is 10 sec. My batch size is approx. 1000 records (approx 150 KB). The processing time of my batch is increasing gradually from 2sec initially to few mins but for the first 40-50 hours it runs quite well. After that, scheduling delay and processing time start shooting up. I had

Reading files from Apache Spark textFileStream

冷暖自知 提交于 2019-12-11 03:02:33
问题 I'm trying to read/monitor txt files from a Hadoop file system directory. But I've noticed all txt files inside this directory are directories themselves as showed in this example bellow: /crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS /crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000 /crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part

Saving twitter streams into a single file with spark streaming, scala

浪子不回头ぞ 提交于 2019-12-11 02:46:01
问题 So after help from this answer Spark Streaming : Join Dstream batches into single output Folder I was able to create a single file for my twitter streams. However,now I don't see any tweets being saved in this file. Please find below my code snippet for this. What am I doing wrong? val ssc = new StreamingContext(sparkConf, Seconds(5)) val stream = TwitterUtils.createStream(ssc, None, filters) val tweets = stream.map(r => r.getText) tweets.foreachRDD{rdd => val sqlContext = SQLContextSingleton

Getting Empty set while reading data from kafka-Spark-Streaming

我的未来我决定 提交于 2019-12-11 02:25:55
问题 Hi i am new to Spark Streaming. i am trying to read the xml file and send it to kafka topic. Here is my Kafka Code Which sends data to Kafka-console-consumer. Code: package org.apache.kafka.Kafka_Producer; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.Properties; import java.util.Properties; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutionException; import kafka

Number of input rows in spark structured streaming with custom sink

雨燕双飞 提交于 2019-12-11 02:24:57
问题 I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero. My stream construction: StreamingQuery writeStream = session .readStream() .schema(RecordSchema.fromClass(TestRecord.class)) .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB) .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF) .csv(s3Path.toString()) .as(Encoders.bean(TestRecord.class)) .flatMap( ((FlatMapFunction<TestRecord,

Spark Streaming: NullPointerException inside foreachPartition

别说谁变了你拦得住时间么 提交于 2019-12-11 02:16:57
问题 I have a spark streaming job which reads from Kafka and does some comparisons with an existing table in Postgres before writing to Postrges again. This is what it looks like : val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) println("First") kafkaDF.foreachPartition( i =>{ val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql://...", "dbtable" -> "table", "user" -> "user",