kafka streaming behaviour for more than one partition

问题

I am consuming from Kafka topic. This topic has 3 partitions. I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).

Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)

Now , the output is something like this :

1 1 1 2 3 2 4 3 5 ....

This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.

But the final output DataSet that i get has all the records.

So , does Spark internally union all the data? How does it makes out what to union? I am trying to understand the behaviour , so that i can form my logic

int count = 0;

messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
            public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
                System.out.println("NUmber of elements in RDD : "+ rdd.count());

                List<Row> rows = rdd.map(record -> processData(record))
                        .reduce((rows1, rows2) -> {
                            rows1.addAll(rows2);
                            return rows1;
                        });

                StructType schema = DataTypes.createStructType(fields);
                Dataset ds = ss.createDataFrame(rows, schema);
                ds.createOrReplaceTempView("trades");                
                ds.show();
            }
        });

回答1:

The assumptions are not completely accurate. foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.

In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.

So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.

Now comes a transformation of the RDD:

rdd.map(record -> processData(record))

As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.

So, does Spark internally union all the data? How does it make out what to union?

The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.

In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()

If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.

A better approach would be:

val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...

In this case, the data of all partitions will remain local to the executor where they are located.

Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.

来源：https://stackoverflow.com/questions/44441068/kafka-streaming-behaviour-for-more-than-one-partition

标签

apache-spark

spark-streaming