If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.