I\'m scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive.
Regarding this:
I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can't find a good java XPath evaluator that works on a string.
This is exactly what I would do (except you don't need to operate on a string (see below)).
A lot of HTML parsers try to do too much. HTMLCleaner, for example, does not properly/completely implement the XPath 1.0 spec (contains
(for example) is an XPath 1.0 function). The good news is that you don't need it to. All you need from HTMLCleaner is for it to parse the malformed input. Once you've done that, it's better to use the standard XML interfaces to deal with the resulting (now well-formed) document.
First convert the document into a standard org.w3c.dom.Document
like this:
TagNode tagNode = new HtmlCleaner().clean(
"<div><table><td id='1234 foo 5678'>Hello</td>");
org.w3c.dom.Document doc = new DomSerializer(
new CleanerProperties()).createDOM(tagNode);
And then use the standard JAXP interfaces to query it:
XPath xpath = XPathFactory.newInstance().newXPath();
String str = (String) xpath.evaluate("//div//td[contains(@id, 'foo')]/text()",
doc, XPathConstants.STRING);
System.out.println(str);
Output:
Hello