We use lucene to process live streams from the internet. It has a native java api.
http://lucene.apache.org/java/docs/
You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.
http://lucene.apache.org/mahout/