spark-streaming

spark submit failed with spark streaming workdcount python code

旧街凉风 提交于 2019-12-21 23:53:36
问题 I just copied the spark streaming wodcount python code, and use spark-submit to run the wordcount python code in Spark cluster, but it shows the following errors: py4j.protocol.Py4JJavaError: An error occurred while calling o23.loadClass. : java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged

Spark Streaming DStream.reduceByKeyAndWindow doesn't work

会有一股神秘感。 提交于 2019-12-21 23:01:39
问题 I am using Apache Spark streaming to do some real-time processing of my web service API logs. The source stream is just a series of API calls with return code. And my Spark app is mainly doing aggregation over the raw API call logs, counting how many API returning certain HTTP code. The batch interval on the source stream is 1 seconds. Then I do : inputStream.reduceByKey(_ + _) where inputStream is of type DStream[(String, Int)]. And now I get the result DStream level1 . Then I do

SparkContext.getOrCreate() purpose

落花浮王杯 提交于 2019-12-21 17:56:25
问题 What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method. If I have 2 spark applications that are run with spark-submit , and in the main method I instantiate the spark context with SparkContext.getOrCreate , both app will have the same context? Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton

SparkContext.getOrCreate() purpose

独自空忆成欢 提交于 2019-12-21 17:56:13
问题 What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method. If I have 2 spark applications that are run with spark-submit , and in the main method I instantiate the spark context with SparkContext.getOrCreate , both app will have the same context? Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton

Kafka spark directStream can not get data

拟墨画扇 提交于 2019-12-21 17:44:01
问题 I'm using spark directStream api to read data from Kafka. My code as following please: val sparkConf = new SparkConf().setAppName("testdirectStreaming") val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(2)) val kafkaParams = Map[String, String]( "auto.offset.reset" -> "smallest", "metadata.broker.list"->"10.0.0.11:9092", "spark.streaming.kafka.maxRatePerPartition"->"100" ) //I set all of the 3 partitions fromOffset are 0 var fromOffsets:Map[TopicAndPartition,

Spark Streaming Accumulated Word Count

懵懂的女人 提交于 2019-12-21 04:18:08
问题 This is a spark streaming program written in scala. It counts the number of words from a socket in every 1 second. The result would be the word count, for example, the word count from time 0 to 1, and the word count then from time 1 to 2. But I wonder if there is some way we could alter this program so that we could get accumulated word count? That is, the word count from time 0 up till now. val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext

How to specify which java version to use in spark-submit command?

和自甴很熟 提交于 2019-12-21 03:47:16
问题 I want to run a spark streaming application on a yarn cluster on a remote server. The default java version is 1.7 but i want to use 1.8 for my application which is also there in the server but is not the default. Is there a way to specify through spark-submit the location of java 1.8 so that i do not get major.minor error ? 回答1: JAVA_HOME was not enough in our case, the driver was running in java 8, but I discovered later that Spark workers in YARN were launched using java 7 (hadoop nodes

How to convert Spark Streaming data into Spark DataFrame

两盒软妹~` 提交于 2019-12-21 01:26:13
问题 So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either. Here's part of my latest version of the code: import sys import re

How to fix Connection reset by peer message from apache-spark?

坚强是说给别人听的谎言 提交于 2019-12-20 18:03:26
问题 I keep getting the the following exception very frequently and I wonder why this is happening? After researching I found I could do .set("spark.submit.deployMode", "nio"); but that did not work either and I am using spark 2.0.0 WARN TransportChannelHandler: Exception in connection from /172.31.3.245:46014 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil

Spark streaming data sharing between batches

倾然丶 夕夏残阳落幕 提交于 2019-12-20 13:53:10
问题 Spark streaming processes the data in micro batches. Each interval data is processed in parallel using RDDs with out any data sharing between each interval. But my use case needs to share the data between intervals. Consider the Network WordCount example which produces the count of all words received in that interval. How would I produce following word count ? Relative count for the words "hadoop" and "spark" with the previous interval count Normal word count for all other words. Note: