spark-streaming-kafka

Avoid write files for empty partitions in Spark Streaming

无人久伴 提交于 2019-12-12 01:26:19
问题 I have Spark Streaming job which reads data from kafka partitions (one executor per partition). I need to save transformed values to HDFS, but need to avoid empty files creation. I tried to use isEmpty, but this doesn't help when not all partitions are empty. P.S. repartition is not an acceptable solution due to perfomance degradation. 回答1: The code works for PairRDD only. Code for text: val conf = ssc.sparkContext.hadoopConfiguration conf.setClass("mapreduce.output.lazyoutputformat

Spark Streaming kafka offset manage

不羁岁月 提交于 2019-12-10 00:03:08
问题 I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem,when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below: kafka_stream = KafkaUtils.createDirectStream( ssc, topics=[config.CONSUME_TOPIC, ], kafkaParams={"bootstrap.servers":

Spark Streaming Kafka stream

扶醉桌前 提交于 2019-12-08 19:36:23
问题 I'm having some issues while trying to read from kafka with spark streaming. My code is: val sparkConf = new SparkConf().setMaster("local[2]").setAppName("KafkaIngestor") val ssc = new StreamingContext(sparkConf, Seconds(2)) val kafkaParams = Map[String, String]( "zookeeper.connect" -> "localhost:2181", "group.id" -> "consumergroup", "metadata.broker.list" -> "localhost:9092", "zookeeper.connection.timeout.ms" -> "10000" //"kafka.auto.offset.reset" -> "smallest" ) val topics = Set("test") val

Spark Streaming Kafka Integration direct Approach EOFException

最后都变了- 提交于 2019-12-08 13:22:26
问题 when i run spark streaming example org.apache.spark.examples.streaming.JavaDirectKafkaWordCount ,i caught an EOFException follow,how can I resolve it Exception in thread "main" org.apache.spark.SparkException: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. at org

Extract the time stamp from kafka messages in spark streaming?

天涯浪子 提交于 2019-12-07 19:01:25
问题 Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming. kafka(version 0.10.0.0) spark streaming(version 2.0.1) 回答1: I'd suggest couple things: Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka) E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1 Than you create a stream, according to the docs above: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker1:9092

Spark Streaming Kafka createDirectStream - Spark UI shows input event size as zero

送分小仙女□ 提交于 2019-12-05 21:20:20
I have implemented Spark Streaming using createDirectStream. My Kafka producer is sending several messages every second to a topic with two partitions. On Spark streaming side, i read kafka messages every second and them I'm windowing them on 5 second window size and frequency. Kafka message are properly processed, i'm seeing the right computations and prints. But in Spark Web UI, under Streaming section, it is showing number of events per window as Zero. Please see this image: I'm puzzled why is it showing Zero, shouldn't it show number of Kafka messages being feed into Spark Stream? Updated:

spark submit failed with spark streaming workdcount python code

时光总嘲笑我的痴心妄想 提交于 2019-12-04 19:25:12
I just copied the spark streaming wodcount python code, and use spark-submit to run the wordcount python code in Spark cluster, but it shows the following errors: py4j.protocol.Py4JJavaError: An error occurred while calling o23.loadClass. : java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) I did build the jar spark

Spark Streaming kafka offset manage

被刻印的时光 ゝ 提交于 2019-12-04 18:58:45
I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem,when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below: kafka_stream = KafkaUtils.createDirectStream( ssc, topics=[config.CONSUME_TOPIC, ], kafkaParams={"bootstrap.servers": config.CONSUME_BROKERS, "auto.offset.reset": "largest"}, fromOffsets=read_offset_range(config.OFFSET_KEY

Kafka Producer - org.apache.kafka.common.serialization.StringSerializer could not be found

柔情痞子 提交于 2019-11-30 20:48:15
I have creating a simple Kafka Producer & Consumer.I am using kafka_2.11-0.9.0.0. Here is my Producer code, public class KafkaProducerTest { public static String topicName = "test-topic-2"; public static void main(String[] args) { // TODO Auto-generated method stub Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("key.serializer", StringSerializer.class.getName()); props.put("value.serializer",

Spark Streaming - read and write on Kafka topic

笑着哭i 提交于 2019-11-27 17:05:19
I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this: input.foreachRDD(rdd => rdd.foreachPartition(partition => partition.foreach { case x: String => { val props = new HashMap[String, Object]() props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers) props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") println(x) val