For example - you might use some classes from standard library java.text
, or use StreamTokenizer
(you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).