Max number of tuple replays on Storm Kafka Spout

前端未结

关注

 5  1396

We’re using Storm with the Kafka Spout. When we fail messages, we’d like to replay them, but in some cases bad data or code errors will cause messages to always fail a Bolt,

相关标签:

5条回答

灰色年华

2021-01-17 18:02

We also face the similar data where we have bad data coming in causing the bolt to fail infinitely.

In order to resolve this on runtime, we have introduced one more bolt naming it as "DebugBolt" for reference. So the spout sends the message to this bolt first and then this bolts does the required data fix for the bad messages and then emits them to the required bolt. This way one can fix the data errors on the fly.

Also, if you need to delete some messages, you can actually pass an ignoreFlag from your DebugBolt to your original Bolt and your original bolt should just send an ack to spout without processing if the ignoreFlag is True.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-01-17 18:08
Storm itself does not provide any support for your problem. Thus, a customized solution is the only way to go. Even if you do not want to patch KafkaSpout, I think, introducing a counter and breaking the replay cycle in it, would be the best approach. As an alternative, you could also inherit from KafkaSpout and put a counter in your subclass. This is of course somewhat similar to a patch, but might be less intrusive and easier to implement.

If you want to use a Bolt, you could do the following (which also requires some changes to the KafkaSpout or a subclass of it).
- Assign an unique IDs as an additional attribute to each tuple (maybe, there is already a unique ID available; otherwise, you could introduce a "counter-ID" or just the whole tuple, ie, all attributes, to identify each tuple).
- Insert a bolt after KafkaSpout via fieldsGrouping on the ID (to ensure that a tuple that is replayed is streamed to the same bolt instance).
- Within your bolt, use a HashMap<ID,Counter> that buffers all tuples and counts the number of (re-)tries. If the counter is smaller than your threshold value, forward the input tuple so it gets processed by the actual topology that follows (of course, you need to anchor the tuple appropriately). If the count is larger than your threshold, ack the tuple to break the cycle and remove its entry from the HashMap (you might also want to LOG all failed tuples).
- In order to remove successfully processed tuples from the HashMap, each time a tuple is acked in KafkaSpout you need to forward the tuple ID to the bolt so that it can remove the tuple from the HashMap. Just declare a second output stream for your KafkaSpout subclass and overwrite Spout.ack(...) (of course you need to call super.ack(...) to ensure KafkaSpout gets the ack, too).
This approach might consume a lot of memory though. As an alternative to have an entry for each tuple in the HashMap you could also use a third stream (that is connected to the bolt as the other two), and forward a tuple ID if a tuple fails (ie, in Spout.fail(...)). Each time, the bolt receives a "fail" message from this third stream, the counter is increase. As long as no entry is in the HashMap (or the threshold is not reached), the bolt simply forwards the tuple for processing. This should reduce the used memory but requires some more logic to be implemented in your spout and bolt.

Both approaches have the disadvantage, that each acked tuple results in an additional message to your newly introduces bolt (thus, increasing network traffic). For the second approach, it might seem that you only need to send a "ack" message to the bolt for tuples that failed before. However, you do not know which tuples did fail and which not. If you want to get rid of this network overhead, you could introduce a second HashMap in KafkaSpout that buffers the IDs of failed messages. Thus, you can only send an "ack" message if a failed tuple was replayed successfully. Of course, this third approach makes the logic to be implemented even more complex.

Without modifying KafkaSpout to some extend, I see no solution for your problem. I personally would patch KafkaSpout or would use the third approach with a HashMap in KafkaSpout subclass and the bolt (because it consumed little memory and does not put a lot of additional load on the network compared to the first two solutions).
0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2021-01-17 18:14
Basically it works like this:
1. If you deploy topologies they should be production grade (this is, a certain level of quality is expected, and the number of tuples low).
2. If a tuple fails, check if the tuple is actually valid.
3. If a tuple is valid (for example failed to be inserted because it's not possible to connect to an external database, or something like this) reply it.
4. If a tuple is miss-formed and can never be handled (for example an database id which is text and the database is expecting an integer) it should be ack, you will never be able to fix such thing or insert it into the database.
5. New kinds of exceptions, should be logged (as well as the tuple contents itself). You should check these logs and generate the rule to validate tuples in the future. And eventually add code to correctly process them (ETL) in the future.
6. Don't log everything, otherwise your log files will be huge, be very selective on what do you log. The contents of the log files should be useful and not a pile of rubbish.
7. Keep doing this, and eventually you will only cover all cases.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-01-17 18:18

We simply had our bolt emit the bad tuple on an error stream and acked it. Another bolt handled the error by writing it back to a Kafka topic specifically for errors. This allows us to easily direct normal vs. error data flow through the topology.

The only case where we fail a tuple is because some required resource is offline, such as a network connection, DB, ... These are retriable errors. Anything else is directed to the error stream to be fixed or handled as is appropriate.

This all assumes of course, that you don't want to incur any data loss. If you only want to attempt a best effort and ignore after a few retries, then I would look at other options.

0 讨论(0)
发布评论:

提交评论
- 加载中...

野的像风

2021-01-17 18:25

As per my knowledge Storm doesn't provide built-in support for this.

I have applied below-mentioned implementation:

public class AuditMessageWriter extends BaseBolt {

        private static final long serialVersionUID = 1L;
        Map<Object, Integer> failedTuple = new HashMap<>();

        public AuditMessageWriter() {

        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
            this.collector = collector;
            //any initialization if u want
        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void execute(Tuple input) {
            try {

            //Write your processing logic
            collector.ack(input);
            } catch (Exception e2) {
            //In case of any exception save the tuple in failedTuple map with a count 1
            //Before adding the tuple in failedTuple map check the count and increase it and fail the tuple

            //if failure count reaches the limit (message reprocess limit) log that and remove from map and acknowledge the tuple
            log(input);
            ExceptionHandler.LogError(e2, "Message IO Exception");
            }

        }

        void log(Tuple input) {

            try {
                //Here u can pass result to dead queue or log that
//And ack the tuple 
            } catch (Exception e) {
                ExceptionHandler.LogError(e, "Exception while logging");
            }
        }

        @Override
        public void cleanup() {
            // To declare output fields.Not required in this alert.
        }

        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            // To declare output fields.Not required in this alert.
        }

        @Override
        public Map<String, Object> getComponentConfiguration() {
            return null;
        }

    }

0 讨论(0)