How to find URLs in HTML using Java

前端 未结 4 502
终归单人心
终归单人心 2021-01-25 20:22

I have the following... I wouldn\'t say problem, but situation.

I have some HTML with tags and everything. I want to search the HTML for every URL. I\'m doing it now by

相关标签:
4条回答
  • 2021-01-25 20:45

    Try using a HTML parsing library then search for <a> tags in the HTML document.

    Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
    Elements links = doc.select("a[href]"); // a with href
    

    not all url are in tags, some are text and some are in links or other tags

    You shouldn't scan the HTML source to achieve this.

    You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.

    Best way is still that you use a tool made for the job.

    You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".

    [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. select("[href*=/path/]")

    See: jSoup.

    0 讨论(0)
  • 2021-01-25 20:54

    Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.

    0 讨论(0)
  • 2021-01-25 20:57

    The best way should be to google for regexes. One example is this one:

        /^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:@]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?@]|%[0-9A-F]{2})*))?$/i
    

    found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

    0 讨论(0)
  • 2021-01-25 21:05

    You may want to have a look at XPath or Regular Expressions.

    0 讨论(0)
提交回复
热议问题