Flush size when using kafka-connect-transform-archive with HdfsSinkConnector

不打扰是莪最后的温柔 提交于 2019-12-11 08:59:41

问题


I have data in a Kafka topic which I want to preserve on my data lake.

Before worrying about the keys, I was able to save the Avro values in files on the datalake using HdfsSinkConnector. The number of message values in each file was determined by the "flush.size" property of the HdfsSinkConnector.

All good. Next I wanted to preserve the keys as well. To do this I used the kafka-connect-transform-archive which wraps the String key and Avro value into a new Avro schema.

This works great ... except that the flush.size for the HdfsSinkConnector is now being ignored. Each file saved in the data lake has exactly 1 message only.

So, the two cases are 1) save values only, with the number of values in each file determined by the flush.size and 2) save keys and values with each file containing exactly one message and flush.size being ignored.

The only difference between the two situations is the configuration for the HdfsSinkConnector which specifies the archive transform.

"transforms": "tran",
"transforms.tran.type": "com.github.jcustenborder.kafka.connect.archive.Archive"

Does the kafka-connect-transform-archive ignore flush size by design, or is there some additional configuration that I need in order to be able to save multiple key, value messages per file on the data lake?


回答1:


i had the same problem when using kafka gcs sink connector.

In com.github.jcustenborder.kafka.connect.archive.Archive code, a new Schema is created per message.

private R applyWithSchema(R r) {
final Schema schema = SchemaBuilder.struct()
    .name("com.github.jcustenborder.kafka.connect.archive.Storage")
    .field("key", r.keySchema())
    .field("value", r.valueSchema())
    .field("topic", Schema.STRING_SCHEMA)
    .field("timestamp", Schema.INT64_SCHEMA);
Struct value = new Struct(schema)
    .put("key", r.key())
    .put("value", r.value())
    .put("topic", r.topic())
    .put("timestamp", r.timestamp());
return r.newRecord(r.topic(), r.kafkaPartition(), null, null, schema, value, r.timestamp());

}

If you look at kafka transform InsertField$Value method, you will see that it use a SynchronizedCache in order to retreive the same schema every time.

https://github.com/axbaretto/kafka/blob/ba633e40ea77f28d8f385c7a92ec9601e218fb5b/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/InsertField.java#L170

So, you just need to create a schema (outside the apply function) or use the same SynchronizedCache code.



来源:https://stackoverflow.com/questions/55865349/flush-size-when-using-kafka-connect-transform-archive-with-hdfssinkconnector

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!