spark-streaming

Joining streaming data on table data and update the table as the stream receives , is it possible?

我与影子孤独终老i 提交于 2019-12-11 19:46:33
问题 I am using spark-sql 2.4.1 , spark-cassandra-connector_2.11-2.4.1.jar and java8. I have scenario , where I need join streaming data with C*/Cassandra table data. If record/join found I need to copy the existing C* table record to another table_bkp and update the actual C* table record with latest data. As the streaming data come in I need to perform this. Is this can be done using spark-sql steaming ? If so , how to do it ? any caveats to take care ? For each batch how to get C* table data

Import statements taking time on spark executors (Pyspark Executors)

十年热恋 提交于 2019-12-11 17:31:36
问题 I am developing a python prediction script using Spark (PySpark) streaming and Keras. The prediction is happening on the executor where I am calling model.predict(). Modules that I have imported are from keras.layers.core import Dense, Activation, Dropout from keras.layers.recurrent import LSTM from keras.models import Sequential I have checked and these imports are taking 2.5 seconds on Spark driver(2 cor + 2gb) to load. What is surprising for me is that each time executor gets the job, it

Why rdd is always empty during Real-Time Kafka Data Ingestion into HBase via PySpark?

眉间皱痕 提交于 2019-12-11 17:19:02
问题 I try to do Real-Time Kafka Data Ingestion into HBase via PySpark in accordance to this tutorial. Everything seems to be working fine. I start kafka sudo /usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties then I run producer /usr/local/kafka/bin/kafka-console-producer.sh --broker-list=myserver:9092 --topic test . Then I run source code shown below. I send messages in producer however rdd.isEmpty(): is always empty. So I don’t achieve line with print("=some

How to add timestamp from kafka to spark streaming during converting to DF

别说谁变了你拦得住时间么 提交于 2019-12-11 17:17:36
问题 I am doing spark streaming from kafka. I want to convert my rdd from kafka to dataframe. i am using following approach. val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(4)) val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002", "security.protocol" -> "SASL_PLAINTEXT", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "1", "auto.offset.reset" -> "latest", "enable.auto

Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

为君一笑 提交于 2019-12-11 17:13:13
问题 I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower. What I don't really see covered is what approaches are used in

How to check spark config for an application in Ambari UI, posted with livy

五迷三道 提交于 2019-12-11 17:00:23
问题 I am posting jobs to a spark cluster using livy APIs. I want to increase the spark.network.timeout value and passing the same value ( 600s ) with the conf field in livy post call. How can I verify that it is getting correctly honoured and getting applied to the jobs posted? 来源: https://stackoverflow.com/questions/55690915/how-to-check-spark-config-for-an-application-in-ambari-ui-posted-with-livy

Stream data using Spark from a partiticular partition within Kafka topics

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 15:45:52
问题 I have already seen a similar question as clickhere But still I want to know if streaming data from a particular partition not possible? I have used Kafka Consumer Strategies in Spark Streaming subscribe method . ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets) This is the code snippet I tried out for subscribing to topic and partition, val topics = Array("cdc-classic") val topic="cdc-classic" val partition=2; val offsets= Map(new TopicPartition(topic, partition) ->

How to decrease the processing time for each batch using Spark Streaming?

♀尐吖头ヾ 提交于 2019-12-11 14:58:06
问题 My goal is to extract data from Kafka using Spark Streaming, transform the data, and store it into a bucket S3 as Parquet files, and using folders based on the date (Partitioned data to faster queries in Athena). My main problem is that the number of actives batches increase during the process and I just want to have only one active batch. I have a delay problem, I tried different configurations, and cluster sizes, to solve each bath in less time that the total duration bath. For example, if

Jobs are queued up in SparkStreaming

时光毁灭记忆、已成空白 提交于 2019-12-11 14:41:12
问题 I have a three node spark cluster on Amazon EC2 two of them are in the 1-a availability zone and one in the 1-b. I started a spark-streaming using 2 cores. One core would be consumed by the receiver and another core left would be for task processing. When the executor for task processing is running on 1-a zone it works perfectly fine. But if the executor for task processing starts on 1-b zone batch jobs are queued a lot and processsing time keep on increasing. I am attaching the screenshot

How to use SparkSession and StreamingContext together?

吃可爱长大的小学妹 提交于 2019-12-11 12:48:56
问题 I'm trying to stream CSV files from a folder on my local machine (OSX). I'm using SparkSession and StreamingContext together like so: val sc: SparkContext = createSparkContext(sparkContextName) val sparkSess = SparkSession.builder().config(sc.getConf).getOrCreate() val ssc = new StreamingContext(sparkSess.sparkContext, Seconds(time)) val csvSchema = new StructType().add("field_name",StringType) val inputDF = sparkSess.readStream.format("org.apache.spark.csv").schema(csvSchema).csv("file://