Spark Streaming: How Spark and Kafka communication happens?

问题

I would like to understand how the communication between the Kafka and Spark(Streaming) nodes takes place. I have the following questions.

If Kafka servers and Spark nodes are in two separate clusters how would be communications takes place. What are the steps need to configure them.
If both are in same clusters but are in different nodes, how will be communication happens.

communication i mean here is whether it is a RPC or Socket communication. I would like to understand the internal anatomy

Any help appreciated.

Thanks in Advance.

回答1:

First of all, it doesn't count if the Kafka nodes and Spark nodes are in the same cluster or not, but they should be able to connect to each other (open ports in firewall).

There are 2 ways to read from Kafka with Spark Streaming, using the older KafkaUtils.createStream() API, and the newer, KafkaUtils.createDirectStream() method.

I don't want to get into the differences between them, that is well documented here (in short, direct stream is better).

Addressing your question, how does the communication happen (internal anatomy): the best way to find out is looking at the Spark source code.

The createStream() API uses a set of Kafka consumers, directly from the official org.apache.kafka packages. These Kafka consumers have their own client called the NetworkClient, which you can check here. In short, the NetworkClient uses sockets for communicating.

The createDirectStream() API does use the Kafka SimpleConsumer from the same org.apache.kafka package. The SimpleConsumer class reads from Kafka with a java.nio.ReadableByteChannel which is a subclass of java.nio.SocketChannel, so in the end it is with done with sockets as well, but a bit more indirectly using Java's Non-blocking I/O convenience APIs.

So to answer your question: it is done with sockets.

来源：https://stackoverflow.com/questions/36027963/spark-streaming-how-spark-and-kafka-communication-happens

标签

apache-spark

apache-kafka

spark-streaming