ClickHouse: Usage of hash and internal_replication in Distributed & Replicated tables

问题

I have read this in the Distributed Engine documentation about internal_replication setting.

If this parameter is set to ‘true’, the write operation selects the first healthy replica and writes data to it. Use this alternative if the Distributed table “looks at” replicated tables. In other words, if the table where data will be written is going to replicate them itself.

If it is set to ‘false’ (the default), data is written to all replicas. In essence, this means that the Distributed table replicates data itself. This is worse than using replicated tables, because the consistency of replicas is not checked, and over time they will contain slightly different data.

I am using the typical KafkaEngine with Materialized View(MV) setup, plus using Distributed tables. I have a cluster of instances, where there are ReplicatedReplacingMergeTree and Distributed tables over them as you can see below:


CREATE TABLE IF NOT EXISTS pageviews_kafka (
// .. fields
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = '%%BROKER_LIST%%',
  kafka_topic_list = 'pageviews',
  kafka_group_name = 'clickhouse-%%DATABASE%%-pageviews',
  kafka_format = 'JSONEachRow',
  kafka_row_delimiter = '\n';

CREATE TABLE IF NOT EXISTS pageviews (
   // fields
) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/%%DATABASE%%/pageviews', '{replica}', processingTimestampNs)
PARTITION BY toYYYYMM(dateTime)
ORDER BY (clientId, toDate(dateTime), userId, pageviewId);

CREATE TABLE IF NOT EXISTS pageviews_d AS pageviews ENGINE = Distributed('my-cluster', %%DATABASE%%, pageviews, sipHash64(toString(pageviewId)));

CREATE MATERIALIZED VIEW IF NOT EXISTS pageviews_mv TO pageviews_d AS
SELECT
 // fields 
FROM pageviews_kafka;

Questions:

I am using default value of internal_replication in the Distributed table, which is false. Does this mean that Distributed table is writing all data to all replicas? So, if I set internal_replication to true, then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table, instead of the whole dataset, hence optimizing data storage? If it's like that, replication would be compromised too - how can you define a certain number replicas?
I am using the id of the entity as the distribution hash. I've read in the ClickHouse Kafka Engine FAQ by Altinity, question "Q. How can I use a Kafka engine table in a cluster?", the following:

Another possibility is to flush data from a Kafka engine table into a Distributed table. It requires more careful configuration, though. In particular, the Distributed table needs to have some sharding key (not a random hash). This is required in order for the deduplication of ReplicatedMergeTree to work properly. Distributed tables will retry inserts of the same block, and those can be deduped by ClickHouse.

However, I am using a semi-random hash here (it is the entity id, the idea being that different copies of the same entity instance - pageview, in this example case - are grouped together). What is the actual problem with it? Why is it discouraged?

回答1:

I am using default value of internal_replication in the Distributed table, which is false.

You SHOULD NOT. It MUST BE TRUE. You were lucky and data were not duplicated yet because of insert deduplication. But eventually it will be duplicated because your Distributed table does 2 identical insert into 2 replica, and replicated table replicates inserted data to another replica (in your case the second replica skips insert from Distributed because you are lucky).

then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table

you are mistaken.

Distributed (internal_replication=true) inserts into ALL SHARDS.

Distributed (internal_replication=false) inserts into ALL SHARDS + into ALL REPLICAS.

It requires more careful configuration I am using a semi-random hash here

It requires more careful configuration and you did it!!!, using -- sipHash64(toString(pageviewId))

You stabilized order and if insert is repeated the same rows goes to the same shard because a shard number for a row is calculated using pageviewId not rand().

来源：https://stackoverflow.com/questions/62024754/clickhouse-usage-of-hash-and-internal-replication-in-distributed-replicated-t

标签

apache-kafka

ClickHouse