spark-streaming | 易学教程

Cannot connect from Spark Streaming to Kafka: org.apache.spark.SparkException: java.net.SocketTimeoutException

阅读更多关于 Cannot connect from Spark Streaming to Kafka: org.apache.spark.SparkException: java.net.SocketTimeoutException

问题 I'm trying to read from a Kafka topic with Spark Streaming direct stream but I receive the following error: INFO consumer.SimpleConsumer: Reconnect due to socket error: java.net.SocketTimeoutException ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: java.net.SocketTimeoutException java.net.SocketTimeoutException org.apache.spark.SparkException: java.net.SocketTimeoutException java.net.SocketTimeoutException at org.apache.spark.streaming.kafka

Can't kill YARN apps using ResourceManager UI after HDP 3.1.0.0-78 upgrade

阅读更多关于 Can't kill YARN apps using ResourceManager UI after HDP 3.1.0.0-78 upgrade

问题 I recently upgraded HDP from 2.6.5 to 3.1.0, which runs YARN 3.1.0, and I can no longer kill applications from the YARN ResourceManager UI, using either the old (:8088/cluster/apps) or new (:8088/ui2/index.html#/yarn-apps/apps) version. I can still kill them using the shell in RHEL 7 with yarn app -kill {app-id} These applications are submitted via Livy. Here is my workflow: Open the ResourceManagerUI, open the Application, click Settings and choose Kill Application. Notice, the 'Logged in as

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

阅读更多关于 What is the correct way of using memSQL Connection object inside call method of Apache Spark code

问题 I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced. Thank You. 回答1: You can use one connection per partition, like this: rdd.foreachPartition

Can spark streaming pick specific files

阅读更多关于 Can spark streaming pick specific files

问题 My program continuously read streams from a hadoop folder(say /hadoopPath/ ) .Its picking all the files from the above folder . Can I pic only specific file types for this folder ( like :/hadoopPath/*.log ) I have another question related to Spark and streaming : Is spark streaming works with both "cp" and "mv" 回答1: I've been struggling with the same problem for a couple of hours and although it seemed so easy, I could not find anything online about it. Finally, I found a solution that worked

In Spark Streaming how to process old data and delete processed Data

阅读更多关于 In Spark Streaming how to process old data and delete processed Data

问题 We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed. 1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older

Why does sbt fail to find KafkaUtils?

阅读更多关于 Why does sbt fail to find KafkaUtils?

问题 I have this error in my code (wordCount from Kafka) compiled with SBT [error] /home/hduser/sbt_project/project1/src/main/scala/sparkKafka.scala:4:35: object kafka is not a member of package org.apache.spark.streaming` [error] import org.apache.spark.streaming.kafka.KafkaUtils not found: value KafkaUtils [error] val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-stream ing-consumer-group", Map("customer" -> 2)) The file build.sbt contains these dependencies: libraryDependencies

Why reading broadcast variable in Spark Streaming got exception after days of running?

阅读更多关于 Why reading broadcast variable in Spark Streaming got exception after days of running?

问题 I am using Spark Streaming (Spark V1.6.0) along with HBase in my project, and HBase(HBase V1.1.2) configurations are transferred among executors with broadcast variable. The Spark Streaming application works at first, while about 2 days later, exception will appear. val hBaseContext: HBaseContext = new HBaseContext(sc, HBaseCock.hBaseConfiguration()) private def _materialDStream(dStream: DStream[(String, Int)], columnName: String, batchSize: Int) = { hBaseContext.streamBulkIncrement[(String,

Spark streaming using MQTTutils to subscribe topic from activemq with authentication

阅读更多关于 Spark streaming using MQTTutils to subscribe topic from activemq with authentication

问题 It seems MQTTUtils Only provide three methods, def createStream(jssc: JavaStreamingContext, brokerUrl: String, topic: String, storageLevel: StorageLevel): JavaDStream[String] Create an input stream that receives messages pushed by a MQTT publisher. def createStream(jssc: JavaStreamingContext, brokerUrl: String, topic: String): JavaDStream[String] Create an input stream that receives messages pushed by a MQTT publisher. def createStream(ssc: StreamingContext, brokerUrl: String, topic: String,

Creating a slice of DStream window

阅读更多关于 Creating a slice of DStream window

问题 My context is that I have a spark custom receiver that receives data stream from a http endpoint. The httpend point is updated every 30 seconds with new data. So it does not make sense for my spark streaming application to aggregate the entire data in the 30 second time frame as it obviously leads to duplicate data (when i save the dstream as a file, each part file that represents an rdd is exactly the same). In order to avoid this de-duplication process,I want a 5 second slice of this window

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

阅读更多关于 java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

问题 I have a Scala Spark Streaming application that receives data from the same topic from 3 different Kafka producers . The Spark streaming application is on machine with host 0.0.0.179 , the Kafka server is on machine with host 0.0.0.178 , the Kafka producers are on machines, 0.0.0.180 , 0.0.0.181 , 0.0.0.182 . When I try to run the Spark Streaming application got below error Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0