spark-streaming-kafka | 易学教程

Avoid write files for empty partitions in Spark Streaming

阅读更多关于 Avoid write files for empty partitions in Spark Streaming

问题 I have Spark Streaming job which reads data from kafka partitions (one executor per partition). I need to save transformed values to HDFS, but need to avoid empty files creation. I tried to use isEmpty, but this doesn't help when not all partitions are empty. P.S. repartition is not an acceptable solution due to perfomance degradation. 回答1: The code works for PairRDD only. Code for text: val conf = ssc.sparkContext.hadoopConfiguration conf.setClass("mapreduce.output.lazyoutputformat

Spark Streaming kafka offset manage

阅读更多关于 Spark Streaming kafka offset manage

问题 I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem，when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below: kafka_stream = KafkaUtils.createDirectStream( ssc, topics=[config.CONSUME_TOPIC, ], kafkaParams={"bootstrap.servers":

Spark Streaming Kafka stream

阅读更多关于 Spark Streaming Kafka stream

问题 I'm having some issues while trying to read from kafka with spark streaming. My code is: val sparkConf = new SparkConf().setMaster("local[2]").setAppName("KafkaIngestor") val ssc = new StreamingContext(sparkConf, Seconds(2)) val kafkaParams = Map[String, String]( "zookeeper.connect" -> "localhost:2181", "group.id" -> "consumergroup", "metadata.broker.list" -> "localhost:9092", "zookeeper.connection.timeout.ms" -> "10000" //"kafka.auto.offset.reset" -> "smallest" ) val topics = Set("test") val

Spark Streaming Kafka Integration direct Approach EOFException

阅读更多关于 Spark Streaming Kafka Integration direct Approach EOFException

问题 when i run spark streaming example org.apache.spark.examples.streaming.JavaDirectKafkaWordCount ,i caught an EOFException follow,how can I resolve it Exception in thread "main" org.apache.spark.SparkException: java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. at org

Extract the time stamp from kafka messages in spark streaming?

阅读更多关于 Extract the time stamp from kafka messages in spark streaming?

问题 Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming. kafka(version 0.10.0.0) spark streaming(version 2.0.1) 回答1: I'd suggest couple things: Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka) E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1 Than you create a stream, according to the docs above: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker1:9092

Spark Streaming Kafka createDirectStream - Spark UI shows input event size as zero

阅读更多关于 Spark Streaming Kafka createDirectStream - Spark UI shows input event size as zero

I have implemented Spark Streaming using createDirectStream. My Kafka producer is sending several messages every second to a topic with two partitions. On Spark streaming side, i read kafka messages every second and them I'm windowing them on 5 second window size and frequency. Kafka message are properly processed, i'm seeing the right computations and prints. But in Spark Web UI, under Streaming section, it is showing number of events per window as Zero. Please see this image: I'm puzzled why is it showing Zero, shouldn't it show number of Kafka messages being feed into Spark Stream? Updated:

spark submit failed with spark streaming workdcount python code

阅读更多关于 spark submit failed with spark streaming workdcount python code

I just copied the spark streaming wodcount python code, and use spark-submit to run the wordcount python code in Spark cluster, but it shows the following errors: py4j.protocol.Py4JJavaError: An error occurred while calling o23.loadClass. : java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) I did build the jar spark

Spark Streaming kafka offset manage

阅读更多关于 Spark Streaming kafka offset manage

I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem，when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below: kafka_stream = KafkaUtils.createDirectStream( ssc, topics=[config.CONSUME_TOPIC, ], kafkaParams={"bootstrap.servers": config.CONSUME_BROKERS, "auto.offset.reset": "largest"}, fromOffsets=read_offset_range(config.OFFSET_KEY

Kafka Producer - org.apache.kafka.common.serialization.StringSerializer could not be found

阅读更多关于 Kafka Producer - org.apache.kafka.common.serialization.StringSerializer could not be found

I have creating a simple Kafka Producer & Consumer.I am using kafka_2.11-0.9.0.0. Here is my Producer code, public class KafkaProducerTest { public static String topicName = "test-topic-2"; public static void main(String[] args) { // TODO Auto-generated method stub Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("key.serializer", StringSerializer.class.getName()); props.put("value.serializer",

Spark Streaming - read and write on Kafka topic

阅读更多关于 Spark Streaming - read and write on Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this: input.foreachRDD(rdd => rdd.foreachPartition(partition => partition.foreach { case x: String => { val props = new HashMap[String, Object]() props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers) props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer") println(x) val