I\'m trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I\'m not sure if that\'s actually
Finally figured it out (mostly):
Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms
Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()