Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

问题

I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower.

What I don't really see covered is what approaches are used in practice to deal with the massive amount of incoming real-time data which cannot be immediately processed due to the lower capacity of the entire pipeline?

I imagine the answer to this is business domain dependent. For some problems it might be fine to just drop that data, but in this question I would like to focus on a case where we don't want to lose any data.

Since I will be working in an AWS environment, my first thought would be to "buffer" the excess data in an SQS queue or a Kinesis stream. Is it as simple as this in practice, or this there a more standard streaming solution to this problem (perhaps as part of Spark itself)?

回答1:

"Is there a more standard streaming solution?" - Maybe. There are a lot of different ways to do this, not immediately clear if there is a "standard" yet. This is just an opinion though, and you're not likely to get a concrete answer for this part.

"Is it as simple as this in practice?" - SQS and Kinesis have different usage patterns:

Use SQS if you want to always process all messages, AND have a single logical consumer
- think of this like a classic queue where messages need to be "consumed" from the queue.
- definitely a simpler model to understand and get going with, but it essentially acts as a buffer
Use Kinesis if you want to easily skip messages, OR have multiple logical consumers
- think of this like a stream with 'skip' functionality, where multiple consumers can be at different points in the stream, or filtering to consume specific message types
- requires a more effort to use, but provides more options for message consumption and dealing with backpressure
- note that (by default) Kinesis will drop data in the stream if it's more than 24h old, though this can be increased to 7 days

For your use case where you have a "massive amount of incoming real-time data which cannot be immediately processed", I'd focus your efforts on Kinesis over SQS, as the Kinesis model also aligns better with other streaming mechanisms like Spark / Kafka.

来源：https://stackoverflow.com/questions/49721023/avoiding-data-loss-when-slow-consumers-force-backpressure-in-stream-processing

标签

amazon-web-services

spark-streaming

amazon-sqs

amazon-kinesis

backpressure