Kafka: Consumer API vs Streams API

问题

I recently started learning Kafka and end up with these questions.

What is the difference between Consumer and Stream? For me, if any tool/application consume messages from Kafka is a consumer in the Kafka world.
How Stream is different as this also consumes from or produce messages to Kafka? and why is it needed as we can write our own consumer application using Consumer API and process them as needed or send them to Spark from the consumer application?

I did Google on this, but did not get any good answers for this. Sorry if this question is too trivial.

回答1:

Update April 09, 2018: Nowadays you can also use KSQL, the streaming SQL engine for Kafka, to process your data in Kafka. KSQL is built on top of Kafka's Streams API, and it too comes with first-class support for "streams" and "tables". Think of it like the SQL brother of Kafka Streams where you don't have to write any programming code in Java or Scala.

what is the difference between Consumer API and Streams API?

Kafka's Streams API (https://kafka.apache.org/documentation/streams/) is built on top of Kafka's producer and consumer clients. It's significantly more powerful and also more expressive than the Kafka consumer client. Here are some of the features of the Kafka Streams API:

Supports exactly-once processing semantics (Kafka versions 0.11+)
Supports fault-tolerant stateful (as well as stateless, of course) processing including streaming joins, aggregations, and windowing. In other words, it supports management of your application's processing state out-of-the-box.
Supports event-time processing as well as processing based on processing-time and ingestion-time
Has first-class support for both streams and tables, which is where stream processing meets databases; in practice, most stream processing applications need both streams AND tables for implementing their respective use cases, so if a stream processing technology lacks either of the two abstractions (say, no support for tables) you are either stuck or must manually implement this functionality yourself (good luck with that...)
Supports interactive queries (also called 'queryable state') to expose the latest processing results to other applications and services
Is more expressive: it ships with (1) a functional programming style DSL with operations such as map, filter, reduce as well as (2) an imperative style Processor API for e.g. doing complex event processing (CEP), and (3) you can even combine the DSL and the Processor API.

See http://docs.confluent.io/current/streams/introduction.html for a more detailed but still high-level introduction to the Kafka Streams API, which should also help you to understand the differences to the lower-level Kafka consumer client. There's also a Docker-based tutorial for the Kafka Streams API, which I blogged about earlier this week.

So how is the Kafka Streams API different as this also consumes from or produce messages to Kafka?

Yes, the Kafka Streams API can both read data as well as write data to Kafka.

and why is it needed as we can write our own consumer application using Consumer API and process them as needed or send them to Spark from the consumer application?

Yes, you could write your own consumer application -- as I mentioned, the Kafka Streams API uses the Kafka consumer client (plus the producer client) itself -- but you'd have to manually implement all the unique features that the Streams API provides. See the list above for everything you get "for free". It is thus rather a rare circumstance that a user would pick the low-level consumer client rather than the more powerful Kafka Streams API.

回答2:

Kafka Stream component built to support ETL type of message transformation. Means to input stream from topic , transform and output to other topic. It support real-time processing and same time support advance analytic features such as aggregation, windowing , join etc.

"Kafka Streams simplifies application development by building on the Kafka producer and consumer libraries and leveraging the native capabilities of Kafka to offer data parallelism, distributed coordination, fault tolerance, and operational simplicity."

Below are key architectural features on Kafka Stream . Please refer here

Stream Partitions and Tasks:Kafka Streams uses the concepts of partitions and tasks as logical units of its parallelism model based on Kafka topic partitions.
Threading Model: Kafka Streams allows the user to configure the number of threads that the library can use to parallelize processing within an application instance.
Local State Stores : Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data, which is an important capability when implementing stateful operations
Fault Tolerance: Kafka Streams builds on fault-tolerance capabilities integrated natively within Kafka. Kafka partitions are highly available and replicated; so when stream data is persisted to Kafka it is available even if the application fails and needs to re-process it.

Based on my understanding below are key differences I am open to update if missing or misleading any point

Where to use Consumer - Producer:

If there are single consumer , consume message process but not spill to other topic.
As point 1 if have just producer producing message the we don't need to Kafka Stream.
If consumer message from one Kafka cluster but publish to different Kafka cluster topic. In that case even you can use Kafka Stream but you have to use separate Producer to publish message tp different cluster. Or simply use Kafka Consumer - Producer mechanism.
Batch processing - if there is requirement to collect message or kind of batch processing its good to use normal traditional way.

Where to use Kafka Stream:

If you consume message from one topic , transform and publish to other topic Kafka Stream is best suited.
Realtime processing, realtime analytic and Machine learning.
Stateful transformation such as aggregation, join, window etc.
Planning to use local state store or mounted store such as Portworx etc.
Achieve Exactly one processing semantic and auto defined fault tolerance.

来源：https://stackoverflow.com/questions/44014975/kafka-consumer-api-vs-streams-api

标签

apache-kafka

kafka-consumer-api

apache-kafka-streams