spark-streaming | 易学教程

Why does memory usage of spark worker increases with time?

阅读更多关于 Why does memory usage of spark worker increases with time?

问题 I have a Spark Streaming application running which uses mapWithState function to track state of RDD. The application runs fine for few minutes but then crashes with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 373 I observed that Memory usage of Spark application increases over time linearly even though i have set the timeout for mapWithStateRDD. Please see the code snippet below and memory usage - val completedSess = sessionLines .mapWithState

All masters are unresponsive ! ? Spark master is not responding with datastax architecture

阅读更多关于 All masters are unresponsive ! ? Spark master is not responding with datastax architecture

问题 Tried using both Spark shell and Spark submit, getting this exception? Initializing SparkContext with MASTER: spark://1.2.3.4:7077 ERROR 2015-06-11 14:08:29 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. WARN 2015-06-11 14:08:29 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Application ID is not initialized yet. ERROR 2015-06-11 14:08:30 org.apache.spark.scheduler.TaskSchedulerImpl

How to access statistics endpoint for a Spark Streaming application?

阅读更多关于 How to access statistics endpoint for a Spark Streaming application?

问题 As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This

Spark Streaming on Kafka print different cases for different values from kafka

阅读更多关于 Spark Streaming on Kafka print different cases for different values from kafka

问题 I am stating my scenario below: 10000 - Servers are sending DF size data. ( Every 5 seconds 10,000 inputs are coming ) If for any server DF size is more than 70 % print increase the ROM size by 20 % If for any server DF size used is less than 30 % print decrease the ROM size by 25 %. I am providing a code that takes messages from kafka and matches with "%" and does to.upper(). This code is just for a reference to my kafka details. Can anyone please help me with the scenario. package rnd

Rules engine for Stream Analytics on Azure

阅读更多关于 Rules engine for Stream Analytics on Azure

问题 I'm new to Azure and to analytics. I'm trying to understand streaming alert rules engine. I have used some sample data as input and have queries to filter data. However i'm not sure of what rules engine means, is it just queries or is there anything more to it and is there a way we can have all rules in one if yes, how? 回答1: The main way to define logic for ASA is to use SQL, which provide a way to define rules with SQL statements (e.g. SELECT DeviceID ... WHERE temperature>50). Multiple

Spark Structured streaming- Using different Windows for different GroupBy Keys

阅读更多关于 Spark Structured streaming- Using different Windows for different GroupBy Keys

问题 Currently i have following table after reading from a Kafka topic via spark structured streaming key,timestamp,value ----------------------------------- key1,2017-11-14 07:50:00+0000,10 key1,2017-11-14 07:50:10+0000,10 key1,2017-11-14 07:51:00+0000,10 key1,2017-11-14 07:51:10+0000,10 key1,2017-11-14 07:52:00+0000,10 key1,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:50:00+0000,10 key2,2017-11-14 07:51:00+0000,10 key2,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:53:00+0000,10 I would like

Spark Streaming: Input Rate and File stream [0] has “Avg: 0.00 events/sec” always

阅读更多关于 Spark Streaming: Input Rate and File stream [0] has “Avg: 0.00 events/sec” always

问题 I am running with spark 1.5.2, with below code. It is printing the count correctly at regular intervals, but in the spark streaming UI the Input Rate and File stream [0] has "Avg: 0.00 events/sec" always. Note: Each file contains a single line containing a json string. I have also tried with each file containing multiple line, still same issue. object main { def main(args: Array[String]) { val conf = new SparkConf().setAppName("test") val sc = new SparkContext(conf) val ssc = new

disabling _spark_metadata in Structured streaming in spark 2.3.0

阅读更多关于 disabling _spark_metadata in Structured streaming in spark 2.3.0

问题 My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine --conf "spark.hadoop.parquet.enable.summary-metadata=false" When the application starts no _spark_metadata folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata folder doesn't exist. Seems structured stream is relying on this folder without

kafka spark-streaming data not getting written into cassandra. zero rows inserted

阅读更多关于 kafka spark-streaming data not getting written into cassandra. zero rows inserted

问题 While writing data to cassandra from spark, data is not getting written. The flash back is: I am doing a kafka-sparkStreaming-cassandra integration. I am reading kafka messages and trying to put it in a cassandra table CREATE TABLE TEST_TABLE(key INT PRIMARY KEY, value TEXT) . kafka to spark-streaming is running cool, but spark to cassandra, there is some issue...data not getting written to table. I am able to create a connection with cassandra, but the data is not getting inserted into the

how to connect spark streaming with cassandra?

阅读更多关于 how to connect spark streaming with cassandra?

问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.