I need to scrape a web page using Java and I\'ve read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I\
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List filenames = new ArrayList();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]
as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.