spark-streaming

Spark worker can not connect to Master

本小妞迷上赌 提交于 2019-12-19 10:14:38
问题 While starting the worker node I get the following error : Spark Command: /usr/lib/jvm/default-java/bin/java -cp /home/ubuntu/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/ubuntu/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m

Continuously INFO JobScheduler:59 - Added jobs for time *** ms in my Spark Standalone Cluster

a 夏天 提交于 2019-12-19 09:58:17
问题 We are working with Spark Standalone Cluster with 8 Cores and 32GB Ram, with 3 nodes cluster with same configuration. Some times streaming batch completed in less than 1sec. some times it takes more than 10 secs at that time below log will appears in console. 2016-03-29 11:35:25,044 INFO TaskSchedulerImpl:59 - Removed TaskSet 18.0, whose tasks have all completed, from pool 2016-03-29 11:35:25,044 INFO DAGScheduler:59 - Job 18 finished: foreachRDD at EventProcessor.java:87, took 1.128755 s

Continuously INFO JobScheduler:59 - Added jobs for time *** ms in my Spark Standalone Cluster

帅比萌擦擦* 提交于 2019-12-19 09:58:15
问题 We are working with Spark Standalone Cluster with 8 Cores and 32GB Ram, with 3 nodes cluster with same configuration. Some times streaming batch completed in less than 1sec. some times it takes more than 10 secs at that time below log will appears in console. 2016-03-29 11:35:25,044 INFO TaskSchedulerImpl:59 - Removed TaskSet 18.0, whose tasks have all completed, from pool 2016-03-29 11:35:25,044 INFO DAGScheduler:59 - Job 18 finished: foreachRDD at EventProcessor.java:87, took 1.128755 s

Idiomatic way to use Spark DStream as Source for an Akka stream

﹥>﹥吖頭↗ 提交于 2019-12-19 07:56:29
问题 I'm building a REST API that starts some calculation in a Spark cluster and responds with a chunked stream of the results. Given the Spark stream with calculation results, I can use dstream.foreachRDD() to send the data out of Spark. I'm sending the chunked HTTP response with akka-http: val requestHandler: HttpRequest => HttpResponse = { case HttpRequest(HttpMethods.GET, Uri.Path("/data"), _, _, _) => HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, source)) } For

Could not parse Master URL: 'spark:http://localhost:18080'

眉间皱痕 提交于 2019-12-19 05:49:12
问题 When I'm trying to run my code it throws this Exception : Exception in thread "main" org.apache.spark.SparkException: Could not parse Master URL:spark:http://localhost:18080 This is my code: SparkConf conf = new SparkConf().setAppName("App_Name").setMaster("spark:http://localhost:18080").set("spark.ui.port","18080"); JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(1000)); String[] filet=new String[]{"Obama","ISI"}; JavaReceiverInputDStream<Status> reciverStream

Stop streaming context in Spark Streaming after a period of time

狂风中的少年 提交于 2019-12-19 05:23:20
问题 I building an application which receives DStreams from Twitter, the only way to stop the Streaming context is by stoping the execution. I wonder if there is a way to set a time and terminate the streaming socket without stoping the entire application? 回答1: You can use either awaitTerminationOrTimeout(long) as mentioned in the previous answer, or you can stop the streaming context manually from your other thread: // in the main thread awaitTermination(); // will wait forever or until the

How to use Spark Streaming with Kafka with Kerberos?

拟墨画扇 提交于 2019-12-18 03:44:44
问题 I have met some issues while trying to consume messages from Kafka with a Spark Streaming application in a Kerberized Hadoop cluster. I tried both of the two approaches listed here : receiver-based approach : KafkaUtils.createStream direct approach (no receivers) : KafkaUtils.createDirectStream The receiver-based approach ( KafkaUtils.createStream ) throws 2 types of exceptions (different exceptions whether I am in local mode ( --master local[*] ) or in YARN mode ( --master yarn --deploy-mode

Limit Kafka batches size when using Spark Streaming

荒凉一梦 提交于 2019-12-17 10:46:34
问题 Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. 回答1: I think your problem can be solved by Spark Streaming Backpressure . Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate . By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is

How to Use both Scala and Python in a same Spark project?

那年仲夏 提交于 2019-12-17 10:30:51
问题 Is that possible to pipe Spark RDD to Python? Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context? 回答1: You can indeed pipe out to a python script using Scala and Spark and a regular Python script. test.py #!/usr/bin/python import sys for line in sys.stdin: print "hello " + line spark-shell (scala) val data = List("john","paul","george","ringo") val dataRDD =

Spark DataFrame: does groupBy after orderBy maintain that order?

不想你离开。 提交于 2019-12-17 07:31:10
问题 I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. It contains 24 entries for each id (one for each hour of the day) and is ordered by id, hour using the orderBy function. I have created an Aggregator groupConcat : def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable { override def zero: String = "" override def reduce(b: