spark-streaming

Spark 2.0.0 streaming job packed with sbt-assembly lacks Scala runtime methods

£可爱£侵袭症+ 提交于 2019-12-24 07:19:10
问题 When using -> in Spark Streaming 2.0.0 jobs, or using spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit , I get the following error: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; I put a brief illustration of this phenomenon

Kafka OffsetOutOfRangeException

你离开我真会死。 提交于 2019-12-24 06:49:27
问题 I am streaming loads of data through kafka. And then I have spark streaming which is consuming these messages. Basically down the line, spark streaming throws this error: kafka.common.OffsetOutOfRangeException Now I am aware what this error means. So I changed the retention policy to 5 days. However I still encountered the same issue. Then I listed all the messages for a topic using --from-beginning in kafka. Surely enough, ton of messages from the beginning of the kafka streaming part were

Spark Streaming Filtering the Streaming data

余生颓废 提交于 2019-12-24 06:48:44
问题 I am trying to filter the Streaming Data, and based on the value of the id column i want to save the data to different tables i have two tables testTable_odd (id,data1,data2) testTable_even (id,data1) if the id value is odd then i want to save record to testTable_odd table and if the value is even then i want to save record to testTable_even. the tricky part here is my two tables has different columns. tried multiple ways, considered Scala functions with return type Either[obj1,obj2], but i

Converting pipe-delimited file to spark dataframe to CSV file

送分小仙女□ 提交于 2019-12-24 06:44:34
问题 I have a CSV file with one single column and the rows are defined as follows : 123 || food || fruit 123 || food || fruit || orange 123 || food || fruit || apple I want to create a csv file with a single column and distinct row values as : orange apple I tried using the following code : val data = sc.textFile("fruits.csv") val rows = data.map(_.split("||")) val rddnew = rows.flatMap( arr => { val text = arr(0) val words = text.split("||") words.map( word => ( word, text ) ) } ) But this code

detecting connection lost in spark streaming

寵の児 提交于 2019-12-24 02:10:01
问题 I am currently working with apache spark streaming. I want to know how to detect whether the connection has lost or not with the external data source, so that we may stop streaming and restart connecting to data source. Thanks in advance for any help 回答1: Add a listener to the receiver you have and stop the streaming context when the receiver has stopped. example: streamContext.addStreamingListener(new StreamingListener() { @Override public void onReceiverStopped

how to package spark scala application

家住魔仙堡 提交于 2019-12-24 01:56:13
问题 I have developed a standalone spark scala application which uses SparkSQL and SparkStreaming. This works fine in Eclipse which is configured for spark. I am newbie in maven. To package this application using maven, I have followed the below tutorial http://ryancompton.net/2014/05/19/sample-pomxml-to-build-scala--jar-with-dependenciesjar/ But ended up with the following error. [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:add-source (scala-compile-first) on project

Spark rdd write to Hbase

烂漫一生 提交于 2019-12-24 01:24:22
问题 I am able to read the messages from Kafka using the below code: val ssc = new StreamingContext(sc, Seconds(50)) val topicmap = Map("test" -> 1) val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap) But, I am trying to read each message from Kafka and putting into HBase. This is my code to write into HBase but no success. lines.foreachRDD(rdd => { rdd.foreach(record => { val i = +1 val hConf = new HBaseConfiguration() val hTable = new HTable(hConf, "test")

storing a Dataframe to a hive partition table in spark

拥有回忆 提交于 2019-12-24 01:23:40
问题 I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this val hiveContext = new HiveContext(sc) hiveContext.setConf("hive.exec.dynamic.partition", "true") hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") newdf.registerTempTable("temp") //newdf is my dataframe newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date")

KafkaStreams EXACTLY_ONCE guarantee - skipping kafka offsets

你。 提交于 2019-12-24 00:59:32
问题 I'm using Spark 2.2.0 and kafka 0.10 spark-streaming library to read from topic filled with Kafka-Streams scala application. Kafka Broker version is 0.11 and Kafka-streams version is 0.11.0.2. When i set EXACTLY_ONCE guarantee in Kafka-Stream app: p.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE) i get this error in Spark: java.lang.AssertionError: assertion failed: Got wrong record for spark-executor-<group.id> <topic> 0 even after seeking to offset 24 at scala

Spark Streaming + Hive

ε祈祈猫儿з 提交于 2019-12-24 00:42:31
问题 We are in a process to build a application that takes data from source system through flume and then with the help of Kafka message system to spark streaming for in memory processing, after processing data into data frame we will put data into hive tables. Flow will be as follows Source System -> Flume -> Kafka -> Spark Streaming -> Hive , Is it correct flow or we need to review it? We are taking Discrete stream and converting it into data frame for SQL compatibility functions. Now we have 14