apache-kafka

What is the difference between Kafka partitions and Kafka replicas?

时光毁灭记忆、已成空白 提交于 2021-02-05 09:21:46
问题 I created 3 Kafka brokers setup with broker id's 20,21,22. Then I created this topic: bin/kafka-topics.sh --zookeeper localhost:2181 \ --create --topic zeta --partitions 4 --replication-factor 3 which resulted in: When a producer sends message "hello world" to topic zeta, to which partition the message first gets written to by Kafka? The "hello world" message gets replicated in all 4 partitions? Each broker among the 3 brokers contain all the 4 partitions? How is that related to replica

Spark structured streaming with kafka leads to only one batch (Pyspark)

扶醉桌前 提交于 2021-02-05 08:47:26
问题 I have the following code and I'm wondering why it generates only one batch: df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "IP").option("subscribe", "Topic").option("startingOffsets","earliest").load() // groupby on slidings windows query = slidingWindowsDF.writeStream.queryName("bla").outputMode("complete").format("memory").start() The application is launched with the following parameters: spark.streaming.backpressure.initialRate 5 spark.streaming.backpressure

ClickHouse JSON parse exception: Cannot parse input: expected ',' before

别等时光非礼了梦想. 提交于 2021-02-05 08:34:06
问题 I'm trying to add JSON data to ClickHouse from Kafka. Here's simplified JSON: { ... "sendAddress":{ "sendCommChannelTypeId":4, "sendCommChannelTypeCode":"SMS", "sendAddress":"789345345945"}, ... } Here's the steps for creating table in ClickHouse, create another table using Kafka Engine and creating MATERIALIZED VIEW to connect these two tables, and also connect CH with Kafka. Creating the first table CREATE TABLE tab ( ... sendAddress Tuple (sendCommChannelTypeId Int32,

How to distribute data evenly in Kafka producing messages through Spark?

依然范特西╮ 提交于 2021-02-05 08:10:45
问题 I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next offset| +-----------------------------------------------------+ |1 | 166522754 | 5861603324 | 6028126078 | |2 | 152251127 | 6010226633 | 6162477760 | |3 | 382935293 | 6332944925 | 6715880218 | |4 | 188126274 | 6171311709 | 6359437983 | |5 | 188270700 | 6100140089 |

How to distribute data evenly in Kafka producing messages through Spark?

大憨熊 提交于 2021-02-05 08:10:41
问题 I have a streaming job that writes data into Kafka and I've noticed one of Kafka partitions (#3) takes more data then others. +-----------------------------------------------------+ | partition | messages | earlist offset | next offset| +-----------------------------------------------------+ |1 | 166522754 | 5861603324 | 6028126078 | |2 | 152251127 | 6010226633 | 6162477760 | |3 | 382935293 | 6332944925 | 6715880218 | |4 | 188126274 | 6171311709 | 6359437983 | |5 | 188270700 | 6100140089 |

Kafka 1.1.0 keeps getting partition leader epoch

巧了我就是萌 提交于 2021-02-05 07:29:04
问题 I have a problem with Kafka. I upgraded to kafka version from 0.11.0.1 to 1.1.0. After the upgrade, I'm getting the below warn message too much. [2018-06-19 13:34:45,377] WARN Received a PartitionLeaderEpoch assignment for an epoch < latestEpoch. This implies messages have arrived out of order. New: {epoch:0, offset:350280659}, Current: {epoch:4, offset:126401625} for Partition: __consumer_offsets-48 (kafka.server.epoch.LeaderEpochFileCache) [2018-06-19 13:34:45,386] WARN Received a

What does a dash represents in CURRENT-OFFSET

帅比萌擦擦* 提交于 2021-02-04 21:15:47
问题 Referring below screenshot of consumer-group description, i am trying to understand what does "-" means here for CURRENT-OFFSET. Does it says that messages are not consumed from partition 1 & 3 even though the partitions are allocated to a consumer. LOG-END offset for partition 1 & 3 are 281 & 277 respectively . 回答1: CURRENT-OFFSET means the current max offset of the consumed messages of the partition for this consumer instance, whereas LOG-END-OFFSET is the offset of the latest message in

Kafka multiple topic consume

女生的网名这么多〃 提交于 2021-02-04 19:31:26
问题 consumer.subscribe(Pattern.compile(".*"),new ConsumerRebalanceListener() { @Override public void onPartitionsRevoked(Collection<TopicPartition> clctn) { } @Override public void onPartitionsAssigned(Collection<TopicPartition> clctn) { } }); How to consume all topics with regex in apache/kafka? I tried above code, but it didn't work. 回答1: For regex use the following signature KafkaConsumer.subscribe(Pattern pattern, ConsumerRebalanceListener listener) E.g. the following code snippet enables the

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

纵饮孤独 提交于 2021-02-04 18:09:05
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

落爺英雄遲暮 提交于 2021-02-04 16:41:16
问题 I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection? Required Parameter > Consumer({'bootstrap.servers': > 'cluster.gcp.confluent.cloud:9092