spark-streaming

How to stop a notebook streaming job gracefully?

泪湿孤枕 提交于 2020-04-16 13:51:46
问题 I have a streaming application which is running into a Databricks notebook job (https://docs.databricks.com/jobs.html). I would like to be able to stop the streaming job gracefully using the stop() method of the StreamingQuery class which is returned by the stream.start() method. That of course requires to either have access to the mentioned streaming instance or to access the context of the running job itself. In this second case the code could look as next: spark.sqlContext.streams.get(

Spark Streaming - DStream does not have distinct()

北城余情 提交于 2020-04-13 17:50:07
问题 I want to count the distinct value of some type of IDs represented as an RDD. In the non-streaming case, it's fairly straightforward. Say IDs is an RDD of IDs read from a flat file. print ("number of unique IDs %d" % (IDs.distinct().count())) But I can't seem to do the same thing in the streaming case. Say we have streamIDs be a DStream of IDs read from the network. print ("number of unique IDs from stream %d" % (streamIDs.distinct().count())) Gives me this error AttributeError:

Unable to get any data when spark streaming program in run taking source as textFileStream

徘徊边缘 提交于 2020-03-24 00:02:28
问题 I am running following code on Spark shell >`spark-shell scala> import org.apache.spark.streaming._ import org.apache.spark.streaming._ scala> import org.apache.spark._ import org.apache.spark._ scala> object sparkClient{ | def main(args : Array[String]) | { | val ssc = new StreamingContext(sc,Seconds(1)) | val Dstreaminput = ssc.textFileStream("hdfs:///POC/SPARK/DATA/*") | val transformed = Dstreaminput.flatMap(word => word.split(" ")) | val mapped = transformed.map(word => if(word.contains(

Unable to get any data when spark streaming program in run taking source as textFileStream

拥有回忆 提交于 2020-03-23 23:57:06
问题 I am running following code on Spark shell >`spark-shell scala> import org.apache.spark.streaming._ import org.apache.spark.streaming._ scala> import org.apache.spark._ import org.apache.spark._ scala> object sparkClient{ | def main(args : Array[String]) | { | val ssc = new StreamingContext(sc,Seconds(1)) | val Dstreaminput = ssc.textFileStream("hdfs:///POC/SPARK/DATA/*") | val transformed = Dstreaminput.flatMap(word => word.split(" ")) | val mapped = transformed.map(word => if(word.contains(

Cannot call methods on a stopped SparkContext

三世轮回 提交于 2020-03-22 10:34:36
问题 When I run the following test, it throws "Cannot call methods on a stopped SparkContext". The possible problem is that I use TestSuiteBase and Streaming Spark Context. At the line val gridEvalsRDD = ssc.sparkContext.parallelize(gridEvals) I need to use SparkContext that I access via ssc.sparkContext and this is where I have the problem (see the warning and error messages below) class StreamingTest extends TestSuiteBase with BeforeAndAfter { test("Test 1") { //... val gridEvals = for

Shutdown spark structured streaming gracefully

泪湿孤枕 提交于 2020-03-22 06:56:09
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Shutdown spark structured streaming gracefully

情到浓时终转凉″ 提交于 2020-03-22 06:55:56
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Shutdown spark structured streaming gracefully

戏子无情 提交于 2020-03-22 06:54:09
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

How to partition data dynamically in this use-case

一世执手 提交于 2020-03-21 11:05:53
问题 I am using spark-sql-2.4.1version. I have a code something like below. I have scenario like below. val superDataset = // load the whole data set of student marks records ... assume have 10 years data val selectedYrsDataset = superDataset.repartition("--GivenYears--") //i.e. GivenYears are 2010,2011 One the selectedYrsDataset I need to calculate year wise toppers on over all country-wise, state-wise, colleage-wise. How to do this kind of use-case ? Is there any possibility of doing it dynamic

How to process/run a list items parallel in spark?

让人想犯罪 __ 提交于 2020-03-09 05:29:11
问题 I am using spark-sql-2.4.1 version with java8 in my PoC. I have following student data there standard/class wise as below public static class Student implements Serializable { private String className; private String studentName; private Integer paperOneMarks; private Integer paperTwoMarks; private Integer paperThreeMarks; private Integer paperFourMarks; public Student(String className, String studentName, Integer paperOneMarks, Integer paperTwoMarks, Integer paperThreeMarks, Integer