There are dozen of screen scraping library written in Java. Just to cite a few :
- TagSoup - a SAX-compliant parser written in Java that, instead
of parsing well-formed or valid XML,
parses HTML as it is found in the
wild: nasty and brutish, though quite
often far from short. TagSoup is
designed for people who have to
process this stuff using some
semblance of a rational application
design. By providing a SAX interface,
it allows standard XML tools to be
applied to even the worst HTML.
- Jericho HTML Parser - Jericho HTML Parser is a simple but powerful
java library allowing analysis and
manipulation of parts of an HTML
document, including some common
server-side tags, while reproducing
verbatim any unrecognised or invalid
HTML. It also provides high-level HTML
form manipulation functions. t is
neither an event nor tree based
parser, but rather uses a combination
of simple text search, efficient tag
recognition and a tag position cache.
The text of the whole source document
is first loaded into memory, and then
only the relevant segments searched
for the relevant characters of each
search operation.
- HTML Cleaner - HtmlCleaner reorders individual elements and
produces well-formed XML from dirty
HTML. It follows similar rules that
the most of web-browsers use in order
to create document object model. A
user may provide custom tag and rule
set for tag filtering and balancing.
- NekoHTML - NekoHTML is a simple HTML scanner and tag balancer that
enables application programmers to
parse HTML documents and access the
information using standard XML
interfaces. The parser can scan HTML
files and "fix up" many common
mistakes that human (and computer)
authors make in writing HTML
documents. NekoHTML adds missing
parent elements; automatically closes
elements with optional end tags; and
can handle mismatched inline element
tags.
And many more at HTML Screen Scraping Tools written in Java. But these are IMO the best to deal with any kind of content (understand all kind of crap) as I mentioned in this previous answer. This might not be an issue for you though.
Just in case, maybe check out the thread Nokogiri pure Java status.
Update: A new project has been released (the 2010-01-31), jsoup, which offers a selector-syntax to find elements. See its website for more details and/or this answer from its author.