spark-streaming

How to run Spark application assembled with Spark 2.1 on cluster with Spark 1.6?

爱⌒轻易说出口 提交于 2019-12-12 15:39:53
问题 I've been told that I could build a Spark application with one version of Spark and, as long as I use sbt assembly to build that, than I can run it with spark-submit on any spark cluster. So, I've build my simple application with Spark 2.1.1. You can see my build.sbt file below. Than I'm starting this on my cluster with: cd spark-1.6.0-bin-hadoop2.6/bin/ spark-submit --class App --master local[*] /home/oracle/spark_test/db-synchronizer.jar So as you see I'm executing it with spark 1.6.0. and

How to access static resources in jar (that correspond to src/main/resources folder)?

对着背影说爱祢 提交于 2019-12-12 13:27:18
问题 I have a Spark Streaming application built with Maven (as jar) and deployed with the spark-submit script. The application project layout follows the standard directory layout: myApp src main scala com.mycompany.package MyApp.scala DoSomething.scala ... resources aPerlScript.pl ... test scala com.mycompany.package MyAppTest.scala ... target ... pom.xml In the DoSomething.scala object I have a method (let's call it doSomething() ) that tries to execute a Perl script -- aPerlScript.pl (from the

Difference between RDDs and Batches in Spark?

大城市里の小女人 提交于 2019-12-12 11:32:00
问题 RDD is a collection of elements partitioned across the nodes of the cluster. It's core component and abstraction. Batches: SparkStreaming API simply divides the data into batches, that batches also same collection of Streaming objects/elements. Based on requirement a set of batches defined in the form time based batch window and intensive online activity based batch window . What is the difference between Rdd & Batches exactly? 回答1: RDD s and batches are essentially different but related

Spark Streaming: Application health

北城以北 提交于 2019-12-12 11:01:14
问题 I have a Kafka based Spark Streaming application that runs every 5 mins. Looking at the statistics after 5 days of run, there are a few observations: The Processing time gradually increases from 30 secs to 50 secs. The snapshot is shown below which highlights the processing time chart: A good number of Garbage collection logs are appearing as shown below: Questions : Is there a good explanation why the Processing Time has increased substantially, even when number of events are more or less

Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD

和自甴很熟 提交于 2019-12-12 08:55:09
问题 I have a spark cluster of two machines and I when I run a spark streaming application I get the following errors : Exception in thread "main" org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[11] at print at StatefulNetworkWordCount.scala:78(1) has different number of partitions from original RDD MapPartitionsRDD[10] at updateStateByKey at StatefulNetworkWordCount.scala:76(2) at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)

Spark mapWithState shuffles all data to one node

隐身守侯 提交于 2019-12-12 08:48:07
问题 I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches. The state is split in 20 partitions, created with StateSpec.function(trackStateFunc _).numPartitions(20) . I had hoped to distribute the state throughout the cluster, but it seems that each node holds the complete state and the execution is always performed only exactly one node. Locality Level Summary: Node local: 50 is shown in the UI for each batch and

Uncaught Exception Handling in Spark

笑着哭i 提交于 2019-12-12 08:34:01
问题 I am working on a Java based Spark Streaming application which responds to messages that come through a Kafka topic. For each message, the application does some processing, and writes back the results to a different Kafka topic. Sometimes due to unexpected data related issues, the code that operates on RDDs might fail and throw an exception. When that happens, I would like to have a generic handler that could take necessary action and drop a message to an error topic. Right now, these

Spark Streaming on EC2: Exception in thread “main” java.lang.ExceptionInInitializerError

纵饮孤独 提交于 2019-12-12 08:05:39
问题 I am trying to run spark-submit on a jar file that I created. When I run it locally on my machine it works correctly but when deployed onto Amazon EC2 it returns the following error. root@ip-172-31-47-217 bin]$ ./spark-submit --master local[2] --class main.java.Streamer ~/streaming-project-1.0-jar-with-dependencies.jar Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.spark.streaming.StreamingContext$.<init>(StreamingContext.scala:728) at org.apache.spark

Exception in thread “main” org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243)

妖精的绣舞 提交于 2019-12-12 07:21:57
问题 i am getting an error when i am trying to run a spark application with cassandra. Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). I am using spark version 1.2.0 and its clear that i am only using one spark context in my application. But whenever i try to add following code for streaming purpose am getting this error. JavaStreamingContext activitySummaryScheduler = new JavaStreamingContext( sparkConf, new Duration

Missing java system properties when running spark-streaming on Mesos cluster

和自甴很熟 提交于 2019-12-12 05:13:48
问题 I submit a spark app to mesos cluster(running in cluster mode), and pass java system property through "--drive-java-options=-Dkey=value -Dkey=value" , however these system properties are not available at runtime, seems they are not set. --conf "spark.driver.extraJavaOptions=-Dkey=value" doesn't work either More details: the command is bin/spark-submit --master mesos://10.3.101.119:7077 --deploy-mode cluster --class ${classname} --driver-java-options "-Dconfiguration.http=http://10.3.101.119