spark-streaming | 易学教程

Spark Error: invalid log directory /app/spark/spark-1.6.1-bin-hadoop2.6/work/app-20161018015113-0000/3/

阅读更多关于 Spark Error: invalid log directory /app/spark/spark-1.6.1-bin-hadoop2.6/work/app-20161018015113-0000/3/

问题 My spark application is failing with the above error. Actually my spark program is writing the logs to that directory. Both stderr and stdout are being written to all the workers. My program use to worik fine earlier. But yesterday i changed the fodler pointed to SPARK_WORKER_DIR. But today i put the old setting back and restarted the spark. Can anyone give me clue on why i am getting this error? 回答1: In my case the problem was caused by the activation of SPARK_WORKER_OPTS="-Dspark.worker

Spark RDD Block Removed Before Use

阅读更多关于 Spark RDD Block Removed Before Use

问题 I am using a Future to perform a blocking operation on an RDD like this: dStreams.foreach(_.foreachRDD { rdd => Future{ writeRDD(rdd) } }) Sometimes I get this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: org.apache.spark.SparkException: Attempted to use BlockRDD[820] at actorStream at Tests.scala:149 after its blocks have been removed! It seems like Spark is having trouble knowing when this RDD should be deleted. Why is this happening and

Spark Streaming: How to get the filename of a processed file in Python

阅读更多关于 Spark Streaming: How to get the filename of a processed file in Python

问题 I'm sort of a noob to Spark (and also Python honestly) so please forgive me if I've missed something obvious. I am doing file streaming with Spark and Python. In the first example I did, Spark correctly listens to the given directory and counts word occurrences in the file, so I know that everything works in terms of listening to the directory. Now I am trying to get the name of the file that is processed for auditing purposes. I read here http://mail-archives.us.apache.org/mod_mbox/spark

Handle database connection inside spark streaming

阅读更多关于 Handle database connection inside spark streaming

问题 I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration): val driver = new MongoDriver val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList val connection = driver.connection(hostList) val mongodb = connection(conf.getString("mongo.db")) val dailyInventoryCol =

Error in starting Spark streaming context

阅读更多关于 Error in starting Spark streaming context

问题 I am new to Spark Streaming and writing a code for twitter connector. when i run this code more than one time, it gives the following exception. I have to create a new hdfs directory for checkpointing each time to make it run successfully and moreover it doesn't get stopped. ERROR StreamingContext: Error starting the context, marking it as stopped org.apache.spark.SparkException: org.apache.spark.streaming.dstream.WindowedDStream@532d0784 has not been initialized at org.apache.spark.streaming

Spark Streaming - java.io.IOException: Lease timeout of 0 seconds expired

阅读更多关于 Spark Streaming - java.io.IOException: Lease timeout of 0 seconds expired

问题 I have spark streaming application using checkpoint writing on HDFS. Has anyone know the solution? Previously we were using the kinit to specify principal and keytab and got the suggestion to specify these via spark-submit command instead kinit but still this error and cause spark streaming application down. spark-submit --principal sparkuser@HADOOP.ABC.COM --keytab /home/sparkuser/keytab/sparkuser.keytab --name MyStreamingApp --master yarn-cluster --conf "spark.driver.extraJavaOptions=-XX:

About an error accessing a field inside Tuple2

阅读更多关于 About an error accessing a field inside Tuple2

问题 i am trying to access to a field within a Tuple2 and compiler is returning me an error. The software tries to push a case class within a kafka topic, then i want to recover it using spark streaming so i can feed a machine learning algorithm and save results within a mongo instance. Solved! I finally solved my problem, i am going to post the final solution: This is the github project: https://github.com/alonsoir/awesome-recommendation-engine/tree/develop build.sbt name := "my-recommendation

java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange

阅读更多关于 java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange

问题 i use spark-streaming 2.2.0 with python. and read data from kafka(2.11-0.10.0.0) cluster. and i submit a python script with spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.2.0.jar hodor.py the spark report a error message 17/08/04 10:52:00 ERROR Utils: Uncaught exception in thread stdout writer for python java.lang.NoSuchMethodError: net.jpountz.util.Utils.checkRange([BII)V at org.apache.kafka.common.message.KafkaLZ4BlockInputStream.read(KafkaLZ4BlockInputStream.java:176) at

Spark streaming is not working in Standalone cluster deployed in VM

阅读更多关于 Spark streaming is not working in Standalone cluster deployed in VM

问题 I have written Kafka stream program using Scala and executing in Spark standalone cluster. Code works fine in my local. I have done Kafka , Cassandra and Spark setup in Azure VM. I have opened all inbound and outbound ports to avoid port blocking. started Master sbin>./start-master.sh Started Slave sbin# ./start-slave.sh spark://vm-hostname:7077 I have verified this status in Master WEB UI. Submit Job bin#./spark-submit --class x.y.StreamJob --master spark://vm-hostname:7077 /home/user/appl

Spark Streaming from Kafka has error numRecords must not be negative

阅读更多关于 Spark Streaming from Kafka has error numRecords must not be negative

问题 Its kind of strange error because I still push data to kafka and consume message from kafka and Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: numRecords must not be negative is kind of strange too. I search and don't get any resource related to. Let me explain my cluster. I have 1 server is master and slave run mesos, on that I set up 3 brokers of kafka like that. Then I run spark-job on that cluster. I am using spark 1.5.2 brokers: id: 0 active: true