spark-streaming

Why do multiple print() methods in Spark Streaming affect the values in my list?

痴心易碎 提交于 2019-12-14 04:18:58
问题 I'm trying to receive one JSON line per two seconds, store them in a List which has elements from a costum Class, created by me, and print the resulting List after each execution of the context. So I'm doing something like this: JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); JavaReceiverInputDStream<String> streamData = ssc.socketTextStream(args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER); JavaDStream<LinkedList<StreamValue>>

Refresh Dataframe in Spark real-time Streaming without stopping process

大兔子大兔子 提交于 2019-12-14 03:53:23
问题 in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka) And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process. **Update:Im planning to publish an event into kafka whenever there is an

Python - Send Integer or String to Spark-Streaming

六月ゝ 毕业季﹏ 提交于 2019-12-14 02:45:39
问题 I can send my data through CSV file. First, write my random numbers into CSV file then send it, but is it possible to send it directly? my socket code: import socket host = 'localhost' port = 8080 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind((host, port)) s.listen(1) while True: print('\nListening for a client at',host , port) conn, addr = s.accept() print('\nConnected by', addr) try: print('\nReading file...\n') while 1: out = "test01" print('Sending line', line) conn.send

kafka and Spark: Get first offset of a topic via API

喜夏-厌秋 提交于 2019-12-14 02:30:56
问题 I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming. The following method: val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest") KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am

Aggregate data based on timestamp in JavaDStream of spark streaming

旧城冷巷雨未停 提交于 2019-12-14 01:56:43
问题 I am writing a spark streaming job in java which takes input record from kafka. Now the record is available in JavaDstream as a custom java object. Sample record is : TimeSeriesData: {tenant_id='581dd636b5e2ca009328b42b', asset_id='5820870be4b082f136653884', bucket='2016', parameter_id='58218d81e4b082f13665388b', timestamp=Mon Aug 22 14:50:01 IST 2016, window=null, value='11.30168'} Now I want to aggregate this data based on min, hour, day and week of the field "timestamp". My question is,

Why does Spark application fail with “Exception in thread ”main“ java.lang.NoClassDefFoundError: …StringDeserializer”?

寵の児 提交于 2019-12-13 17:12:02
问题 I am developing a Spark application that listens to a Kafka stream using Spark and Java. I use kafka_2.10-0.10.2.1. I have set various parameters for Kafka properties: bootstrap.servers , key.deserializer , value.deserializer , etc. My application compiles fine, but when I submit it, it fails with the following error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer I do use StringDeserializer for key.deserializer and value

Spark Streaming Kafka direct stream processing time performance spikes

两盒软妹~` 提交于 2019-12-13 15:51:27
问题 I have a Spark Streaming job that reads data from a Kafka cluster using the direct approach. There is a cyclical spike in processing times that I cannot understand and is not reflected in the Spark UI metrics. The following image shows this pattern (batch time = 10s): This issue is reproducible every time the job is run. There is no data in the Kafka logs to be read so there is no real processing, of note, to perform . I would expect the line to be flat, near the minimum value to serialize

Back pressure in Kafka

让人想犯罪 __ 提交于 2019-12-13 12:28:10
问题 I have a situation in Kafka where the producer publishes the messages at a very higher rate than the consumer consumption rate. I have to implement the back pressure implementation in kafka for further consumption and processing. Please let me know how can I implement in spark and also in normal java api. 回答1: Kafka acts as the regulator here. You produce at whatever rate you want to into Kafka, scaling the brokers out to accommodate the ingest rate. You then consume as you want to; Kafka

Spark Streaming Window Operation

人盡茶涼 提交于 2019-12-13 11:45:50
问题 The following is simple code to get the word count over a window size of 30 seconds and slide size of 10 seconds. import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming.api._ import org.apache.spark.storage.StorageLevel val ssc = new StreamingContext(sc, Seconds(5)) // read from text file val lines0 = ssc.textFileStream("test") val words0 = lines0

How to read logs from file in kafka?

孤者浪人 提交于 2019-12-13 09:02:53
问题 I want to read Apache logs in kafka and then further process in to Spark Streaming.I am new to kafka. As far as I have understand I have to write a producer class to read logs file. 回答1: You can do so by creating a connector which sources each line of the log file into the Kafka topic. Check out the example here: https://docs.confluent.io/current/connect/devguide.html#connect-developing-simple-connector 来源: https://stackoverflow.com/questions/46508901/how-to-read-logs-from-file-in-kafka