How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

后端 未结 2 833
既然无缘 2021-01-07 03:08

I\'m looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something.

I\'m using Spark Streaming and spark-streaming

  • 2021-01-07 03:33

    if dStream contains data that you want to send to Kafka:

    dStream.foreachRDD(rdd -> {
        rdd.foreachPartition(iter ->{
            Producer producer = createKafkaProducer();  
            while (iter.hasNext()){

    So, you create one producer per each RDD partition.

    0 讨论(0)
  • 2021-01-07 03:36

    In my example I want to send events took from a specific kafka topic to another one. I do a simple wordcount. That means, I take data from kafka input topic, count them and output them in a output kafka topic. Don't forget the goal is to write results of JavaPairDStream into output kafka topic using Spark Streaming.

    //Spark Configuration
    SparkConf sparkConf = new SparkConf().setAppName("SendEventsToKafka");
    String brokerUrl = "locahost:9092"
    String inputTopic = "receiverTopic";
    String outputTopic = "producerTopic";
    //Create the java streaming context
    JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
    //Prepare the list of topics we listen for
    Set<String> topicList = new TreeSet<>();
    //Kafka direct stream parameters
    Map<String, Object> kafkaParams = new HashMap<>();
    kafkaParams.put("bootstrap.servers", brokerUrl);
    kafkaParams.put("", "kafka-cassandra" + new SecureRandom().nextInt(100));
    kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
    kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
    //Kafka output topic specific properties
    Properties props = new Properties();
    props.put("bootstrap.servers", brokerUrl);
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    props.put("acks", "1");
    props.put("retries", "3");
    props.put("", 5);
    //Here we create a direct stream for kafka input data.
    final JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(jssc,
            ConsumerStrategies.<String, String>Subscribe(topicList, kafkaParams));
    JavaPairDStream<String, String> results = messages
            .mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
                public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
                    return new Tuple2<>(record.key(), record.value());
    JavaDStream<String> lines = Function<Tuple2<String, String>, String>() {
        public String call(Tuple2<String, String> tuple2) {
            return tuple2._2();
    JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
        public Iterator<String> call(String x) {
  "Line retrieved {}", x);
            return Arrays.asList(SPACE.split(x)).iterator();
    JavaPairDStream<String, Integer> wordCounts = words.mapToPair(new PairFunction<String, String, Integer>() {
        public Tuple2<String, Integer> call(String s) {
  "Word to count {}", s);
            return new Tuple2<>(s, 1);
    }).reduceByKey(new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer i1, Integer i2) {
  "Count with reduceByKey {}", i1 + i2);
            return i1 + i2;
    //Here we iterrate over the JavaPairDStream to write words and their count into kafka
    wordCounts.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
        public void call(JavaPairRDD<String, Integer> arg0) throws Exception {
            Map<String, Integer> wordCountMap = arg0.collectAsMap();
            List<WordOccurence> topicList = new ArrayList<>();
            for (String key : wordCountMap.keySet()) {
                 //Here we send event to kafka output topic
                 publishToKafka(key, wordCountMap.get(key), outputTopic);
            JavaRDD<WordOccurence> WordOccurenceRDD = jssc.sparkContext().parallelize(topicList);
                    .writerBuilder(keyspace, table, CassandraJavaUtil.mapToRow(WordOccurence.class))
  "Words successfully added : {}, keyspace {}, table {}", words, keyspace, table);

    wordCounts variable is of type JavaPairDStream<String, Integer>, I just ierrate using foreachRDD and write into kafka using a specific function:

    public static void publishToKafka(String word, Long count, String topic, Properties props) {
        KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);
        try {
            ObjectMapper mapper = new ObjectMapper();
            String jsonInString = mapper.writeValueAsString(word + " " + count);
            String event = "{\"word_stats\":" + jsonInString + "}";
  "Message to send to kafka : {}", event);
            producer.send(new ProducerRecord<String, String>(topic, event));
  "Event : " + event + " published successfully to kafka!!");
        } catch (Exception e) {
            log.error("Problem while publishing the event to kafka : " + e.getMessage());

    Hope that helps!

    0 讨论(0)