Some help scraping a page in Java

后端 未结 4 920
小鲜肉
小鲜肉 2021-01-12 15:27

I need to scrape a web page using Java and I\'ve read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.

I\

4条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-12 16:27

    1. Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
    2. Extract required information using XPath expressions.

    Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.

    public static void main(String[] args) throws Exception {
        // Create a new JTidy instance and set options
        Tidy tidy = new Tidy();
        tidy.setXHTML(true); 
    
        // Parse an HTML page into a DOM document
        URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
        Document doc = tidy.parseDOM(url.openStream(), System.out);
    
        // Use XPath to obtain whatever you want from the (X)HTML
        XPath xpath = XPathFactory.newInstance().newXPath();
        XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
        NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
        List filenames = new ArrayList();
        for (int i = 0; i < nodes.getLength(); i++) {
            filenames.add(nodes.item(i).getNodeValue()); 
        }
    
        System.out.println(filenames);
    }
    

    The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.

    Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.

提交回复
热议问题