I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.
It is also easier to write and to read.
<html>
<body>
<a href="1.html">1</a>
<a href="2.html">2</a>
<a href="3.html">3</a>
</body>
</html>
With the html above, this expression "/html/body/a" will list all href elements.
Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html
A quick google search turned up a nice looking possibility, TagSoup.
Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.
Html parser + Regular expressions Any language would do it, though I'd say Perl is the fastest solution.
Parsing using XMlSlurper only works if HTMl is well-formed.
If your HTMl page has non-well-formed tags, then use regex for parsing the page.
Ex: <a href="www.google.com">
here, 'a' is not closed and thus not well formed.
new URL(url).eachLine{
(it =~ /.*<A HREF="(.*?)">/).each{
// process hrefs
}
}
Try a regular expression. Something like this should work:
(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text ->
// do something with url and text
}
Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.