spark-streaming

Spark custom streaming dropping most of the data

回眸只為那壹抹淺笑 提交于 2019-12-13 07:58:18
问题 I'm following the example for spark streaming using customer receiver as given in the spark site available at Spark customer receiver. However, the job seems to drop most my data. Whatever the amount of data I stream, it is successfully received at the consumer. However, when I do any map/ flatmap operations on it, I just see 10 rows of data. This is always the case no matter how much data I stream. I have modified this program to read from ActiveMQ queue. If I look at ActiveMQ web interface,

Spark Streaming textFileStream watch output of RDD.saveAsTextFile

青春壹個敷衍的年華 提交于 2019-12-13 07:56:44
问题 Running Spark 1.6.2 (YARN mode) Firstly, I have some code from this post to get filenames within Spark Streaming, so that could be the issue, but hopefully not. Basically, I have this first job. import org.apache.spark.SparkContext import org.apache.spark.streaming.{StreamingContext, Seconds} import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.TextInputFormat def getStream(ssc: StreamingContext, dir: String): DStream[String] = { ssc.fileStream

HDFS : java.io.FileNotFoundException : File does not exist: name._COPYING

怎甘沉沦 提交于 2019-12-13 07:05:42
问题 I'm working with Spark Streaming using Scala. I need to read a .csv file dinamically from HDFS directory with this line: val lines = ssc.textFileStream("/user/root/") I use the following command line to put the file into HDFS: hdfs dfs -put ./head40k.csv It works fine with a relatively small file. When I try with a larger one, I get this error: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/root/head800k.csv._COPYING I can understand why, but

Spark Direct Streaming - consume same message in multiple consumers

余生长醉 提交于 2019-12-13 07:03:20
问题 How to consume Kakfa topic messages in multiple Consumers using Direct Stream approach? Is it possible? Since Direct Stream approach doesn't have Consumer Group concept. What happens, if i pass group.id as kafkaparams for the DirectStream method? The below code works with group.id as Kafka Params also without group.id . Sample Code: val kafkaParams = Map( "group.id" -> "group1", CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache

Pyspark - Transfer control out of Spark Session (sc)

萝らか妹 提交于 2019-12-13 07:01:20
问题 This is a follow up question on Pyspark filter operation on Dstream To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job. What I have tried: from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext def counts(): counter += 1 print(counter.value) if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>",

Error while trying to connect cassandra database using spark streaming

耗尽温柔 提交于 2019-12-13 05:52:01
问题 I'm working in a project which uses Spark streaming, Apache kafka and Cassandra. I use streaming-kafka integration. In kafka I have a producer which sends data using this configuration: props.put("metadata.broker.list", KafkaProperties.ZOOKEEPER); props.put("bootstrap.servers", KafkaProperties.SERVER); props.put("client.id", "DemoProducer"); where ZOOKEEPER = localhost:2181 , and SERVER = localhost:9092 . Once I send data I can receive it with spark, and I can consume it too. My spark

Should spark broadcast variables' type be number or string when I try to restart a job from checkpoint

▼魔方 西西 提交于 2019-12-13 05:47:56
问题 When I set a collection as broadcast variables, it always reback to me serialization error, I has already tried Map, HashMap, Array,all failed 回答1: it's a known bug of Spark : https://issues.apache.org/jira/browse/SPARK-5206 you can use singleton object to let each executor loads the data itself . you can check https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java for full exemple : class JavaWordBlacklist {

Spark : How to append to cached rdd?

不羁的心 提交于 2019-12-13 05:18:37
问题 Distinct values are cached with every streamed batch of data. How do i build the cache by adding the next distinct values in the next batch to the already cached RDD? 回答1: You can not directly append your data with Rdd because its immutable. Using union to create new Rdd and then cache it. 来源: https://stackoverflow.com/questions/34077905/spark-how-to-append-to-cached-rdd

Error while running standalone app example in python using spark

核能气质少年 提交于 2019-12-13 04:57:12
问题 I am just getting started on spark and am running it on standalone mode over amazon EC2 instance. I was trying examples mentioned in the documentation and while going through this example called Simple App I keep getting this error: NameError: name 'numAs' is not defined from pyspark import SparkContext logFile = "$YOUR_SPARK_HOME/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a'

Kafka uncommitted message not getting consumed again

烈酒焚心 提交于 2019-12-13 03:46:29
问题 I am processing kafka messages and inserting into kudu table using spark streaming with manual offset commit here is my code. val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG