Currently I am working on a project where I have setup a Storm cluster across four Unix hosts.
The topology itself is as follows:
- JMS Spout listens to an MQ for new messages
- JMS Spout parses and then emits the result to an Esper Bolt
- The Esper Bolt then processes the event and emits a result to a JMS Bolt
- The JMS Bolt then publishes the message back onto the MQ on a different topic
I realize that Storm is a "at least-once" framework. However, if I receive 5 events and pass these onto the Esper Bolt for counting then for some reason I am receiving 5 count results in the JMS Bolt(all the same value).
Ideally, I want to receive one result output, is there some way I can tell Storm to ignore duplicate tuples?
I think this has something to do with the parallelism that I have setup because it works as expected when I just have a single thread:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(JMS_DATA_SPOUT, new JMSDataSpout(),2).setNumTasks(2);
builder.setBolt("esperBolt", new EsperBolt.Builder().build(),6).setNumTasks(6)
.fieldsGrouping(JMS_DATA_SPOUT,new Fields("eventGrouping"));
builder.setBolt("jmsBolt", new JMSBolt(),2).setNumTasks(2).fieldsGrouping("esperBolt", new Fields("eventName"));
I have also seen Trident for "exactly-once" semantics. I am not fully convinced this would solve this issue however.
If your Esper Bolt does not explicitly ack() each tuple at the end of its execute() method OR use an iBasicBolt implementation, then each tuple it receives will eventually be replayed by your origin JMS Spout after a timeout.
Alternatively, if you are asking your bolt to "only process unique messages" consider adding this processing behavior to your execute() method. It could first check a local Guava cache for tuple value uniqueness, then process accordingly.
来源:https://stackoverflow.com/questions/23764293/storm-cluster-duplicate-tuples