spark-streaming

Cannot connect from Spark Streaming to Kafka: org.apache.spark.SparkException: java.net.SocketTimeoutException

99封情书 提交于 2019-12-11 12:15:59
问题 I'm trying to read from a Kafka topic with Spark Streaming direct stream but I receive the following error: INFO consumer.SimpleConsumer: Reconnect due to socket error: java.net.SocketTimeoutException ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: java.net.SocketTimeoutException java.net.SocketTimeoutException org.apache.spark.SparkException: java.net.SocketTimeoutException java.net.SocketTimeoutException at org.apache.spark.streaming.kafka

Can't kill YARN apps using ResourceManager UI after HDP 3.1.0.0-78 upgrade

旧巷老猫 提交于 2019-12-11 11:06:31
问题 I recently upgraded HDP from 2.6.5 to 3.1.0, which runs YARN 3.1.0, and I can no longer kill applications from the YARN ResourceManager UI, using either the old (:8088/cluster/apps) or new (:8088/ui2/index.html#/yarn-apps/apps) version. I can still kill them using the shell in RHEL 7 with yarn app -kill {app-id} These applications are submitted via Livy. Here is my workflow: Open the ResourceManagerUI, open the Application, click Settings and choose Kill Application. Notice, the 'Logged in as

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

一个人想着一个人 提交于 2019-12-11 10:49:58
问题 I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced. Thank You. 回答1: You can use one connection per partition, like this: rdd.foreachPartition

Can spark streaming pick specific files

梦想的初衷 提交于 2019-12-11 10:43:44
问题 My program continuously read streams from a hadoop folder(say /hadoopPath/ ) .Its picking all the files from the above folder . Can I pic only specific file types for this folder ( like :/hadoopPath/*.log ) I have another question related to Spark and streaming : Is spark streaming works with both "cp" and "mv" 回答1: I've been struggling with the same problem for a couple of hours and although it seemed so easy, I could not find anything online about it. Finally, I found a solution that worked

In Spark Streaming how to process old data and delete processed Data

别等时光非礼了梦想. 提交于 2019-12-11 10:35:24
问题 We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed. 1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older

Why does sbt fail to find KafkaUtils?

风格不统一 提交于 2019-12-11 10:17:52
问题 I have this error in my code (wordCount from Kafka) compiled with SBT [error] /home/hduser/sbt_project/project1/src/main/scala/sparkKafka.scala:4:35: object kafka is not a member of package org.apache.spark.streaming` [error] import org.apache.spark.streaming.kafka.KafkaUtils not found: value KafkaUtils [error] val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-stream ing-consumer-group", Map("customer" -> 2)) The file build.sbt contains these dependencies: libraryDependencies

Why reading broadcast variable in Spark Streaming got exception after days of running?

一曲冷凌霜 提交于 2019-12-11 09:53:48
问题 I am using Spark Streaming (Spark V1.6.0) along with HBase in my project, and HBase(HBase V1.1.2) configurations are transferred among executors with broadcast variable. The Spark Streaming application works at first, while about 2 days later, exception will appear. val hBaseContext: HBaseContext = new HBaseContext(sc, HBaseCock.hBaseConfiguration()) private def _materialDStream(dStream: DStream[(String, Int)], columnName: String, batchSize: Int) = { hBaseContext.streamBulkIncrement[(String,

Spark streaming using MQTTutils to subscribe topic from activemq with authentication

那年仲夏 提交于 2019-12-11 09:09:57
问题 It seems MQTTUtils Only provide three methods, def createStream(jssc: JavaStreamingContext, brokerUrl: String, topic: String, storageLevel: StorageLevel): JavaDStream[String] Create an input stream that receives messages pushed by a MQTT publisher. def createStream(jssc: JavaStreamingContext, brokerUrl: String, topic: String): JavaDStream[String] Create an input stream that receives messages pushed by a MQTT publisher. def createStream(ssc: StreamingContext, brokerUrl: String, topic: String,

Creating a slice of DStream window

不羁岁月 提交于 2019-12-11 08:06:30
问题 My context is that I have a spark custom receiver that receives data stream from a http endpoint. The httpend point is updated every 30 seconds with new data. So it does not make sense for my spark streaming application to aggregate the entire data in the 30 second time frame as it obviously leads to duplicate data (when i save the dstream as a file, each part file that represents an rdd is exactly the same). In order to avoid this de-duplication process,I want a 5 second slice of this window

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

我怕爱的太早我们不能终老 提交于 2019-12-11 08:04:20
问题 I have a Scala Spark Streaming application that receives data from the same topic from 3 different Kafka producers . The Spark streaming application is on machine with host 0.0.0.179 , the Kafka server is on machine with host 0.0.0.178 , the Kafka producers are on machines, 0.0.0.180 , 0.0.0.181 , 0.0.0.182 . When I try to run the Spark Streaming application got below error Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0