spark-streaming | 易学教程

How to stop a notebook streaming job gracefully?

阅读更多关于 How to stop a notebook streaming job gracefully?

问题 I have a streaming application which is running into a Databricks notebook job (https://docs.databricks.com/jobs.html). I would like to be able to stop the streaming job gracefully using the stop() method of the StreamingQuery class which is returned by the stream.start() method. That of course requires to either have access to the mentioned streaming instance or to access the context of the running job itself. In this second case the code could look as next: spark.sqlContext.streams.get(

Spark Streaming - DStream does not have distinct()

阅读更多关于 Spark Streaming - DStream does not have distinct()

问题 I want to count the distinct value of some type of IDs represented as an RDD. In the non-streaming case, it's fairly straightforward. Say IDs is an RDD of IDs read from a flat file. print ("number of unique IDs %d" % (IDs.distinct().count())) But I can't seem to do the same thing in the streaming case. Say we have streamIDs be a DStream of IDs read from the network. print ("number of unique IDs from stream %d" % (streamIDs.distinct().count())) Gives me this error AttributeError:

Unable to get any data when spark streaming program in run taking source as textFileStream

阅读更多关于 Unable to get any data when spark streaming program in run taking source as textFileStream

问题 I am running following code on Spark shell >`spark-shell scala> import org.apache.spark.streaming._ import org.apache.spark.streaming._ scala> import org.apache.spark._ import org.apache.spark._ scala> object sparkClient{ | def main(args : Array[String]) | { | val ssc = new StreamingContext(sc,Seconds(1)) | val Dstreaminput = ssc.textFileStream("hdfs:///POC/SPARK/DATA/*") | val transformed = Dstreaminput.flatMap(word => word.split(" ")) | val mapped = transformed.map(word => if(word.contains(

Unable to get any data when spark streaming program in run taking source as textFileStream

阅读更多关于 Unable to get any data when spark streaming program in run taking source as textFileStream

Cannot call methods on a stopped SparkContext

阅读更多关于 Cannot call methods on a stopped SparkContext

问题 When I run the following test, it throws "Cannot call methods on a stopped SparkContext". The possible problem is that I use TestSuiteBase and Streaming Spark Context. At the line val gridEvalsRDD = ssc.sparkContext.parallelize(gridEvals) I need to use SparkContext that I access via ssc.sparkContext and this is where I have the problem (see the warning and error messages below) class StreamingTest extends TestSuiteBase with BeforeAndAfter { test("Test 1") { //... val gridEvals = for

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

How to partition data dynamically in this use-case

阅读更多关于 How to partition data dynamically in this use-case

问题 I am using spark-sql-2.4.1version. I have a code something like below. I have scenario like below. val superDataset = // load the whole data set of student marks records ... assume have 10 years data val selectedYrsDataset = superDataset.repartition("--GivenYears--") //i.e. GivenYears are 2010,2011 One the selectedYrsDataset I need to calculate year wise toppers on over all country-wise, state-wise, colleage-wise. How to do this kind of use-case ? Is there any possibility of doing it dynamic

How to process/run a list items parallel in spark?

阅读更多关于 How to process/run a list items parallel in spark?

问题 I am using spark-sql-2.4.1 version with java8 in my PoC. I have following student data there standard/class wise as below public static class Student implements Serializable { private String className; private String studentName; private Integer paperOneMarks; private Integer paperTwoMarks; private Integer paperThreeMarks; private Integer paperFourMarks; public Student(String className, String studentName, Integer paperOneMarks, Integer paperTwoMarks, Integer paperThreeMarks, Integer