Writing large DataFrame from PySpark to Kafka runs into timeout

后端 未结 1 1324
故里飘歌
故里飘歌 2021-01-06 05:10

I\'m trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I\'m not sure if that\'s actually

1条回答
  •  逝去的感伤
    2021-01-06 05:26

    Finally figured it out (mostly):

    Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms

    Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.

    dfKafka \
    .write  \
    .format("kafka") \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", EH_SASL) \
    .option("kafka.batch.size", 5000) \
    .option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
    .option("kafka.request.timeout.ms", 120000) \
    .option("topic", "raw") \
    .option("checkpointLocation", "/mnt/telemetry/cp.txt") \
    .save()
    

    0 讨论(0)
提交回复
热议问题