Full Link Extraction using java

后端 未结 2 2001
感情败类
感情败类 2021-01-16 04:45

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example: Suppose think that a html file it have somany li

相关标签:
2条回答
  • 2021-01-16 05:15

    Use the URL object:

    URL url = new URL(URL context, String spec)

    Here's an example:

    import java.net.*;

    public class Test {
    public static void main(String[] args) throws Exception {
       URL base = new URL("http://www.java.com/dit/index.html");   
       URL url = new URL(base, "../hello.html");
    
       System.out.println(base);
       System.out.println(url);
    }
    }
    

    It will print:

    http://www.java.com/dit/index.html
    http://www.java.com/hello.html
    
    0 讨论(0)
  • 2021-01-16 05:18

    You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.

    package com.stackoverflow.q3394298;
    
    import java.net.URL;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    
    public class Test {
        
        public static void main(String... args) throws Exception {
            URL url = new URL("https://stackoverflow.com/questions/3394298/");
            Document document = Jsoup.connect(url).get();
            Element link = document.select("a.question-hyperlink").first();
            System.out.println(link.attr("href"));
            System.out.println(link.absUrl("href"));
        }
        
    }
    

    which prints (correctly) the following for the title link of your current question:

    /questions/3394298/full-link-extraction-using-java
    https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
    

    Jsoup may have more other (undiscovered) advantages for your purpose as well.

    Related questions:

    • What are the pros and cons of the leading HTML parsers in Java?

    Update: if you want to select all links in the document, then do as follows:

            Elements links = document.select("a");
            for (Element link : links) {
                System.out.println(link.attr("href"));
                System.out.println(link.absUrl("href"));
            }
    
    0 讨论(0)
提交回复
热议问题