Kafka Streams - How to scale Kafka store generated changelog topics

问题

I am having multiple redundant app instances that want to consume all the events of a topic and store them independently for disk lookup (via a rocksdb).

For the sake of the argument, let's assume these redundant consumers are serving stateless http request; so the load is not shared using kafka, but kafka is rather used to replicate data from a producer into each of the instance localstore.

When looking at the topics generated, each consuming apps created 3 extra topics :

{topicname}STATE-STORE-0000000000-changelog
{application-name}-{storename}-changelog
{application-name}-{storename}-repartition

But each of these generated topics are as big as the compacted view of the original topic. Meaning each consuming store multiplies by 3 the size of the original topic (which was already compacted).

Why does kafka store require these 3 topics. Couldn't we simply configure the stream to reload from the last consumed offset when reconciling the ondisk store?
Is it the idea that each instance of the redundant consuming apps gets its unique set of 3 "store generated topics", or should they be configured to share the same set of changelog topics? so, should they share a same applicationId or rather not since they need to consume all the event of all the partitions?

In short, I am concerned by the storage scalability as we grow the number of consuming apps that would spawn more change log topics...

here is the code that creates the store

public class ProgramMappingEventStoreFactory {
  private static final Logger logger = Logger.getLogger(ProgramMappingEventStoreFactory.class.getName());
  private final static String STORE_NAME = "program-mapping-store";
  private final static String APPLICATION_NAME = "epg-mapping-catalog_program-mapping";

  public static ReadOnlyKeyValueStore<ProgramMappingEventKey, ProgramMappingEvent> newInstance(String kafkaBootstrapServerUrl,
                                                                                               String avroRegistryUrl,
                                                                                               String topic,
                                                                                               String storeDirectory)
  {
    Properties kafkaConfig = new KafkaConfigBuilder().withBootstrapServers(kafkaBootstrapServerUrl)
                                                     .withSchemaRegistryUrl(avroRegistryUrl)
                                                     .withApplicationId(createApplicationId(APPLICATION_NAME))
                                                     .withGroupId(UUID.randomUUID().toString())
                                                     .withClientId(UUID.randomUUID().toString())
                                                     .withDefaultKeySerdeClass(SpecificAvroSerde.class)
                                                     .withDefaultValueSerdeClass(SpecificAvroSerde.class)
                                                     .withStoreDirectory(storeDirectory)
                                                     .build();

    StreamsBuilder streamBuilder = new StreamsBuilder();
    bootstrapStore(streamBuilder, topic);
    KafkaStreams streams = new KafkaStreams(streamBuilder.build(), kafkaConfig);
    streams.start();
    try {
      return getStoreAndBlockUntilQueryable(STORE_NAME,
                                            QueryableStoreTypes.keyValueStore(),
                                            streams);
    } catch (InterruptedException e) {
      throw new IllegalStateException("Failed to create the LiveMediaPolicyIdStore", e);
    }
  }

  private static <T> T getStoreAndBlockUntilQueryable(String storeName,
                                                      QueryableStoreType<T> queryableStoreType,
                                                      KafkaStreams streams)
    throws InterruptedException
  {
    while (true) {
      try {
        return streams.store(storeName, queryableStoreType);
      } catch (InvalidStateStoreException ignored) {
        Thread.sleep(100);
      }
    }
  }

  private static void bootstrapStore(StreamsBuilder builder, String topic) {
    KTable<ProgramMappingEventKey, ProgramMappingEvent> table = builder.table(topic);

    table.groupBy((k, v) -> KeyValue.pair(k, v)).reduce((newValue, aggValue) -> newValue,
                                                        (newValue, aggValue) -> null,
                                                        Materialized.as(STORE_NAME));

  }

  private static String createApplicationId(String applicationName) {
    try {
      return String.format("%s-%s", applicationName, InetAddress.getLocalHost().getHostName());
    } catch (UnknownHostException e) {
      logger.warning(() -> "Failed to find the hostname, generating a uique applicationId");
      return String.format("%s-%s", applicationName, UUID.randomUUID());
    }
  }

}

回答1:

If you want to load the same state into multiple instances, you should use GlobalKTable and a unique application.id over all instances (builder.globalTable()).

If you use KTable data is partitioned forcing you to use different application.id for each instance. This can be considered an anti-pattern.

I am also not sure, why you do groupBy((k, v) -> KeyValue.pair(k, v)).reduce() -- this results in an unnecessary repartition topic.

For the generated changelog topics for table() operator, there is a know bug in 1.0 and 1.1 release if StreamsBuilder is used (KStreamBuilder is not affected). Its fixed in 2.0 release (https://issues.apache.org/jira/browse/KAFKA-6729)

来源：https://stackoverflow.com/questions/50936571/kafka-streams-how-to-scale-kafka-store-generated-changelog-topics

标签

apache-kafka

apache-kafka-streams