How to pick a Kafka transaction.id

前端 未结 3 1439
梦谈多话
梦谈多话 2020-12-28 17:05

I wonder could I get some help understanding transactions in Kafka and in particular how I use transaction.id. Here\'s the context:

  1. My Kafka application follow
相关标签:
3条回答
  • 2020-12-28 17:41

    When using the Streams API (in contrast to the regular Kafka producers) you do not have to worry about setting a unique transactional.id per instance of your stream application. When you enable Streams exactly_once semantics, the Streams API will generate the proper/ unique transactional.id based on the topic/ partition.

    Check this out what exactly is done here: https://github.com/axbaretto/kafka/blob/fe51708ade3cdf4fe9640c205c66e3dd1a110062/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java#L455

    The Task (referring to TaskId in the code) is explained here: https://docs.confluent.io/current/streams/architecture.html#stream-partitions-and-tasks

    0 讨论(0)
  • 2020-12-28 17:58

    The blog article you mentioned has all the information you're looking for, although it's rather dense.

    From the Why Transactions? section in aforementioned article.

    Using vanilla Kafka producers and consumers configured for at-least-once delivery semantics, a stream processing application could lose exactly once processing semantics in the following ways:

    1. The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.

    2. We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.

    3. Finally, in distributed environments, applications will crash or—worse!—temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.” [emphasis added]

    From the Transactional Semantics section in same article.

    Zombie fencing

    We solve the problem of zombie instances by requiring that each transactional producer be assigned a unique identifier called the transactional.id. This is used to identify the same producer instance across process restarts. [emphasis added]

    The API requires that the first operation of a transactional producer should be to explicitly register its transactional.id with the Kafka cluster. When it does so, the Kafka broker checks for open transactions with the given transactional.id and completes them. It also increments an epoch associated with the transactional.id. The epoch is an internal piece of metadata stored for every transactional.id.

    Once the epoch is bumped, any producers with same transactional.id and an older epoch are considered zombies and are fenced off, ie. future transactional writes from those producers are rejected. [emphasis added]

    And from the Data flow section in the same article.

    A: the producer and transaction coordinator interaction

    When executing transactions, the producer makes requests to the transaction coordinator at the following points:

    1. The initTransactions API registers a transactional.id with the coordinator. At this point, the coordinator closes any pending transactions with that transactional.id and bumps the epoch to fence out zombies. This happens only once per producer session. [emphasis added]

    2. When the producer is about to send data to a partition for the first time in a transaction, the partition is registered with the coordinator first.

    3. When the application calls commitTransaction or abortTransaction, a request is sent to the coordinator to begin the two phase commit protocol.

    Hope this helps!

    0 讨论(0)
  • 2020-12-28 18:02

    Consider the situation where the consumer group populace is in flux (new consumers are coming online or going offline) or a failure scenario causes the rebalancing of topic-partition assignments within a consumer group.

    Now assume a consumer C0 had previously been assigned partition P0. This consumer is happily chugging away and processing messages, publishing new ones, etc. (The standard consume-transform-publish pattern.) A rebalance event occurs, resulting in P0 being unceremoniously (always wanted to use that word) revoked from C0 and assigned to C1. From the perspective of C0, it might still have a backlog of messages to churn through, and it is oblivious to the reassignment. You end up in a situation where both C0 and C1 for a very brief period of time may believe they are both 'owning' P0 and will act accordingly, creating duplicate messages in the outgoing topic and, worse, having those duplicates potentially appearing out of order.

    The use of transactional.id enables the 'fencing' that the original blog refers to. As part of the reassignment, the new producer will act under the incremented epoch number, while the existing one will still use the old epoch. Fencing is then trivial; drop messages where the epoch has lapsed.

    There are a few gotchas with Kafka transactions:

    • The inbound and outbound topics must be on the same cluster for transactions to work.
    • The naming of transactional.id is crucial for producer 'handover', even if you don't care about zombie fencing. The emergence of the new producer will instigate the tidying up of any orphaned in-flight transactions for the lapsed producer, hence the requirement for the ID to be stable/repeatable across producer sessions. Do not use random IDs for this; not only will this lead to incomplete transactions (which blocks every consumers in READ_COMMITTED mode), but it will also accumulate additional state on the Transactional Coordinator (running on the broker). By default, this state will be persisted for 7 days, so you don't want to spawn arbitrarily named transactional producers on a whim.
    • Ideally transactional.id reflects the combination of both the inbound topic and partition. (Unless, of course, you have a single-partition topic.) In practice, this means creating a new transactional producer for every partition assigned to the consumer. (Remember, in a consume-trasform-publish scenario, a producer is also a consumer, and consumer partition assignments will vary with each rebalancing event.) Have a look at the spring-kafka implementation, which lazily creates a new producer for each inbound partition. (There is something to be said about the safety of this approach, and whether producers should be cleaned up on partition reassignment, but that's another matter.)
    • The fencing mechanism only operates at Kafka level. In other words, it isolates the lapsed producer from Kafka, but not from the rest of the world. This means that if your producer also has to update some external state (in a database, cache, etc.) as part of the consume-transform-publish cycle, it is the responsibility of the application to fence itself from the database upon partition reassignment, or otherwise ensure the idempotency of the update.

    Just for completeness, it's worth pointing out that this is not the only way to achieve fencing. The Kafka consumer API does provide the user the ability to register a ConsumerRebalanceListener, which gives the displaced consumer a last-chance way of draining any outstanding backlog (or shedding it) before reassigning the partitions to the new consumer. The callback is blocking; when it returns it is assumed that the handler has fenced itself off locally; then, and only then will the new consumer resume processing.

    0 讨论(0)
提交回复
热议问题