From Postgres to Kafka with changes tracking

问题

This question follows this one.

The main task is to make joins on KSQL side. Example below will illustrate it. Incidents messages arrive In Kafka topic. The structure of that messages:

[
    {
        "name": "from_ts", 
        "type": "bigint"
    },
    {
        "name": "to_ts", 
        "type": "bigint"
    },
    {
        "name": "rulenode_id",
        "type": "int"
    }
]

And there is a Postgres table rulenode:

id | name | description

Data from both sources need to be joined by fields rulenode_id = rulenode.id so as to get single record with fields from_ts, to_ts, rulenode_id, rulenode_name, rulenode_description.

I want to do this by means of KSQL but not backend as it is now.

Right now data from Postgres table transferred to Kafka by JdbcSourceConnector. But there is one little problem - as you could guess data in Postgres table may be changed. And of course these changes should be in KSQL stream OR table too.

Below I've been asked why KTable and not Kstream. Well, please, visit this page and look at the first GIF. There records of table are being updated when new data arrive. I thought such behaviour is what I need (where instead of names Alice, Bob I have primary key id of Postgres table rulenode). That's why I chose KTable.

Bulk mode of JdbcSourceConnect copies all of the table. And as you know all rows arrive into Kafka table to previous Postgres table snapshots.

As suggested I created a connector with configs:

{
  "name": "from-pg",
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "errors.log.enable": "true",
  "connection.url": "connection.url",
  "connection.user": "postgres",
  "connection.password": "*************",
  "table.whitelist": "rulenode",
  "mode": "bulk",
  "poll.interval.ms": "5000",
  "topic.prefix": "pg."
}

Then created a stream:

create stream rulenodes 
    with (kafka_topic='pg.rules_rulenode', value_format='avro', key='id');

and now trying to create a table:

create table rulenodes_unique 
    as select * from rulenodes;

but that didn't work with error:

Invalid result type. Your SELECT query produces a STREAM. Please use CREATE STREAM AS SELECT statement instead.

I read that tables are used when to store aggregated info. For example to store aggregated with COUNT function:

create table rulenodes_unique 
    as select id, count(*) from rulenodes order by id;

Can you say please how to handle that error?

回答1:

You can create a STREAM or a TABLE on top of a Kafka topic with ksqlDB - it's to do with how you want to model the data. From your question it is clear that you need to model it as a table (because you want to join to the latest version of a key). So you need to do this:

create table rulenodes 
    with (kafka_topic='pg.rules_rulenode', value_format='avro');

Now there is one more thing you have to do, which is ensure that the data in your topic is correctly keyed. You cannot specify key='id' and it automagically happen - the key parameter is just a 'hint'. You must make sure that the messages in the Kafka topic have the id field in the key. See ref doc for full details.

You can do this with a Single Message Transform in Kafka Connect:

"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"

Or you can do it in ksqlDB and change the key - and because we want to process every event we first model it as a stream (!) and the declare the table over the re-keyed topic:

create stream rulenodes_source 
    with (kafka_topic='pg.rules_rulenode', value_format='avro');

CREATE STREAM RULENODES_REKEY AS SELECT * FROM rulenodes_source PARITION BY id;

CREATE TABLE rulenodes WITH (kafka_topic='RULENODES_REKEY', value_format='avro');

I would go the Single Message Transform route because it is neater and simpler overall.

回答2:

It's not clear which statement throws the error, but it's misleading if on the table definition

You can create tables from topics directly. No need to go through a stream

https://docs.confluent.io/current/ksql/docs/developer-guide/create-a-table.html

If you want to use the stream as well, as the docs say

Use the CREATE TABLE AS SELECT statement to create a table with query results from an existing table or stream.

You may want to use case sensitive values in the statements

CREATE STREAM rulenodes WITH (
    KAFKA_TOPIC ='pg.rules_rulenode', 
    VALUE_FORMAT='AVRO', 
    KEY='id'
);


CREATE TABLE rulenodes_unique AS
    SELECT id, COUNT(*) FROM rulenodes 
    ORDER BY id;

来源：https://stackoverflow.com/questions/60491786/from-postgres-to-kafka-with-changes-tracking

标签

apache-kafka

apache-kafka-connect

ksql