Spark Structured Streaming with RabbitMQ source

后端 未结 1 434
情话喂你
情话喂你 2021-01-13 09:18

I am trying to write a custom receiver for Structured Streaming that will consume messages from RabbitMQ. Spark recently released Data

相关标签:
1条回答
  • 2021-01-13 09:29
    > I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct? The source [socket][1] source implementation has one thread pushing messages into the internal ListBuffer. In other words, there is one consumer (the thread) filling up the internal ListBuffer which is **then** divided up into partitions by `planInputPartitions`( `createDataReaderFactories` got [renamed][2] to `planInputPartitions`). Also, according to the Javadoc of [MicroBatchReadSupport][3] > The execution engine will create a micro-batch reader at the start of a streaming query, alternate calls to setOffsetRange and createDataReaderFactories for each batch to process, and then call stop() when the execution is complete. Note that a single query may have multiple executions due to restart or failure recovery. In other words, the `createDataReaderFactories` should be called **multiple** times, which to my understanding suggests that each `DataReader` is responsible for a static input partition, which implies that the DataReader shouldn't be a consumer. ---------- > However, the commit method is a part of MicroBatchReader, not DataReader ... If so, what is the purpose of commit function then? Perhaps part of the rationale for the commit function is to prevent the internal buffer of the MicroBatchReader from getting to big. By committing an Offset, you can effectively remove elements less than the Offset from the buffer as you are making a commitment to not process them anymore. You can see this happening in the socket source code with `batches.trimStart(offsetDiff)`
    I'm unsure about implementing a reliable receiver, so I hope a more experienced Spark guy comes around and grabs your question as I'm interested too! Hope this helps!

    EDIT

    I had only studied the socket, and wiki-edit sources. These sources are not production ready, which is something that the question was was not looking for. Instead, the kafka source is the better starting point which has, unlike the aforementioned sources, multiple consumers like the author was looking for.

    However, perhaps if you're looking for unreliable sources, the socket and wikiedit sources above provide a less complicated solution.

    0 讨论(0)
提交回复
热议问题