Regular expression for parsing links from a webpage?

前端 未结 9 674
南旧
南旧 2020-11-27 20:02

I\'m looking for a .NET regular expression extract all the URLs from a webpage but haven\'t found one to be comprehensive enough to cover all the different ways you can spec

相关标签:
9条回答
  • 2020-11-27 20:38

    Look at the URI specification. That could help you a lot. And as far as performance goes, you can pretty much extract all the HTTP links in a modest web page. When I say modest I definitely do not mean one page all encompassing HTML manuals like that of ELisp manual. Also performance is a touchy topic. My advice would be to measure your performance and then decide if you are going to extract all the links using one single regex or with multiple simpler regex expressions.

    http://gbiv.com/protocols/uri/rfc/rfc3986.html

    0 讨论(0)
  • 2020-11-27 20:41

    I don't have time to try and think of a regex that probably won't work, but I wanted to comment that you should most definitely break up your regex, at least if it gets to this level of ugliness:

    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
    \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
    ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
    \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
    ....*SNIP*....
    *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
    +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
    .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
    |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
    ?:\r\n)?[ \t])*))*)?;\s*)
    

    (this supposedly matches email addresses)

    Edit: I can't even fit it on one post it's so nasty....

    0 讨论(0)
  • 2020-11-27 20:43

    according to http://tools.ietf.org/html/rfc3986

    extracting url from ANY text (not only HTML)

    (http\\://[:/?#\\[\\]@!%$&'()*+,;=a-zA-Z0-9._\\-~]+)
    
    0 讨论(0)
  • 2020-11-27 20:44
    ((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)
    

    I took this from regexlib.com

    [editor's note: the {1} has no real function in this regex; see this post]

    0 讨论(0)
  • 2020-11-27 20:49

    This will capture the URLs from all a tags as long as the author of the HTML used quotes:

    <a[^>]+href="([^"]+)"[^>]*>
    

    I made an example here.

    0 讨论(0)
  • 2020-11-27 20:50

    With Html Agility Pack, you can use:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("file.htm");
    foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
    {
    Response.Write(link["href"].Value);
    }
    doc.Save("file.htm");
    
    0 讨论(0)
提交回复
热议问题