spark-streaming | 易学教程

Spark 2.0.0 streaming job packed with sbt-assembly lacks Scala runtime methods

阅读更多关于 Spark 2.0.0 streaming job packed with sbt-assembly lacks Scala runtime methods

问题 When using -> in Spark Streaming 2.0.0 jobs, or using spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit , I get the following error: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; I put a brief illustration of this phenomenon

Kafka OffsetOutOfRangeException

阅读更多关于 Kafka OffsetOutOfRangeException

问题 I am streaming loads of data through kafka. And then I have spark streaming which is consuming these messages. Basically down the line, spark streaming throws this error: kafka.common.OffsetOutOfRangeException Now I am aware what this error means. So I changed the retention policy to 5 days. However I still encountered the same issue. Then I listed all the messages for a topic using --from-beginning in kafka. Surely enough, ton of messages from the beginning of the kafka streaming part were

Spark Streaming Filtering the Streaming data

阅读更多关于 Spark Streaming Filtering the Streaming data

问题 I am trying to filter the Streaming Data, and based on the value of the id column i want to save the data to different tables i have two tables testTable_odd (id,data1,data2) testTable_even (id,data1) if the id value is odd then i want to save record to testTable_odd table and if the value is even then i want to save record to testTable_even. the tricky part here is my two tables has different columns. tried multiple ways, considered Scala functions with return type Either[obj1,obj2], but i

Converting pipe-delimited file to spark dataframe to CSV file

阅读更多关于 Converting pipe-delimited file to spark dataframe to CSV file

问题 I have a CSV file with one single column and the rows are defined as follows : 123 || food || fruit 123 || food || fruit || orange 123 || food || fruit || apple I want to create a csv file with a single column and distinct row values as : orange apple I tried using the following code : val data = sc.textFile("fruits.csv") val rows = data.map(_.split("||")) val rddnew = rows.flatMap( arr => { val text = arr(0) val words = text.split("||") words.map( word => ( word, text ) ) } ) But this code

detecting connection lost in spark streaming

阅读更多关于 detecting connection lost in spark streaming

问题 I am currently working with apache spark streaming. I want to know how to detect whether the connection has lost or not with the external data source, so that we may stop streaming and restart connecting to data source. Thanks in advance for any help 回答1: Add a listener to the receiver you have and stop the streaming context when the receiver has stopped. example: streamContext.addStreamingListener(new StreamingListener() { @Override public void onReceiverStopped

how to package spark scala application

阅读更多关于 how to package spark scala application

问题 I have developed a standalone spark scala application which uses SparkSQL and SparkStreaming. This works fine in Eclipse which is configured for spark. I am newbie in maven. To package this application using maven, I have followed the below tutorial http://ryancompton.net/2014/05/19/sample-pomxml-to-build-scala--jar-with-dependenciesjar/ But ended up with the following error. [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:add-source (scala-compile-first) on project

Spark rdd write to Hbase

阅读更多关于 Spark rdd write to Hbase

问题 I am able to read the messages from Kafka using the below code: val ssc = new StreamingContext(sc, Seconds(50)) val topicmap = Map("test" -> 1) val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap) But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success. lines.foreachRDD(rdd => { rdd.foreach(record => { val i = +1 val hConf = new HBaseConfiguration() val hTable = new HTable(hConf, "test")

storing a Dataframe to a hive partition table in spark

阅读更多关于 storing a Dataframe to a hive partition table in spark

问题 I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this val hiveContext = new HiveContext(sc) hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") newdf.registerTempTable("temp") //newdf is my dataframe newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date")

KafkaStreams EXACTLY_ONCE guarantee - skipping kafka offsets

阅读更多关于 KafkaStreams EXACTLY_ONCE guarantee - skipping kafka offsets

问题 I'm using Spark 2.2.0 and kafka 0.10 spark-streaming library to read from topic filled with Kafka-Streams scala application. Kafka Broker version is 0.11 and Kafka-streams version is 0.11.0.2. When i set EXACTLY_ONCE guarantee in Kafka-Stream app: p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE) i get this error in Spark: java.lang.AssertionError: assertion failed: Got wrong record for spark-executor-<group.id> <topic> 0 even after seeking to offset 24 at scala

Spark Streaming + Hive

阅读更多关于 Spark Streaming + Hive

问题 We are in a process to build a application that takes data from source system through flume and then with the help of Kafka message system to spark streaming for in memory processing, after processing data into data frame we will put data into hive tables. Flow will be as follows Source System -> Flume -> Kafka -> Spark Streaming -> Hive , Is it correct flow or we need to review it? We are taking Discrete stream and converting it into data frame for SQL compatibility functions. Now we have 14