Extract links from a web page

前端 未结 6 998
遇见更好的自我
遇见更好的自我 2020-12-01 08:22

Using Java, how can I extract all the links from a given web page?

相关标签:
6条回答
  • 2020-12-01 08:40

    download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use

    File input = new File("/tmp/input.html");
     Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
    
    Elements links = doc.select("a[href]"); // a with href
    Elements pngs = doc.select("img[src$=.png]");
    // img with src ending .png
    
    Element masthead = doc.select("div.masthead").first();
    

    and find all links and then get the detials using

    String linkhref=links.attr("href");
    

    Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

    The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

    EDIT: In case you want more tutorials, you can try out this one made by mkyong.

    http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

    0 讨论(0)
  • 2020-12-01 08:42

    Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

    A simple regex which would match 99% of pages could be this:

    // The HTML page as a String
    String HTMLPage;
    Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
    Matcher pageMatcher = linkPattern.matcher(HTMLPage);
    ArrayList<String> links = new ArrayList<String>();
    while(pageMatcher.find()){
        links.add(pageMatcher.group());
    }
    // links ArrayList now contains all links in the page as a HTML tag
    // i.e. <a att1="val1" ...>Text inside tag</a>
    

    You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:

    Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
    

    And access the link part with .group(1) and the text part with .group(2)

    0 讨论(0)
  • 2020-12-01 08:43

    You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

    0 讨论(0)
  • 2020-12-01 08:53

    You can use the HTML Parser library to achieve this:

    public static List<String> getLinksOnPage(final String url) {
        final Parser htmlParser = new Parser(url);
        final List<String> result = new LinkedList<String>();
    
        try {
            final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
            for (int j = 0; j < tagNodeList.size(); j++) {
                final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
                final String loopLinkStr = loopLink.getLink();
                result.add(loopLinkStr);
            }
        } catch (ParserException e) {
            e.printStackTrace(); // TODO handle error
        }
    
        return result;
    }
    
    0 讨论(0)
  • 2020-12-01 08:54
    import java.io.*;
    import java.net.*;
    
    public class NameOfProgram {
        public static void main(String[] args) {
            URL url;
            InputStream is = null;
            BufferedReader br;
            String line;
    
            try {
                url = new URL("http://www.stackoverflow.com");
                is = url.openStream();  // throws an IOException
                br = new BufferedReader(new InputStreamReader(is));
    
                while ((line = br.readLine()) != null) {
                    if(line.contains("href="))
                        System.out.println(line.trim());
                }
            } catch (MalformedURLException mue) {
                 mue.printStackTrace();
            } catch (IOException ioe) {
                 ioe.printStackTrace();
            } finally {
                try {
                    if (is != null) is.close();
                } catch (IOException ioe) {
                    //exception
                }
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-01 09:01

    This simple example seems to work, using a regex from here

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public ArrayList<String> extractUrlsFromString(String content)
    {
        ArrayList<String> result = new ArrayList<String>();
    
        String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
    
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(content);
        while (m.find())
        {
            result.add(m.group());
        }
    
        return result;
    }
    

    and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.

    import org.apache.commons.io.IOUtils;
    
    public String getUrlContentsAsString(String urlAsString)
    {
        try
        {
            URL url = new URL(urlAsString);
            String result = IOUtils.toString(url);
            return result;
        }
        catch (Exception e)
        {
            return null;
        }
    }
    
    0 讨论(0)
提交回复
热议问题