Regular Expression to Extract the Url out of the Anchor Tag

前端 未结 3 542
隐瞒了意图╮
隐瞒了意图╮ 2020-12-11 12:54

I want to extract the http link from inside the anchor tags? The extension that should be extracted should be WMV files only.

相关标签:
3条回答
  • 2020-12-11 12:55

    I wouldn't do this with regex - I would probably use jQuery:

    jQuery('a[href$=.wmv]').attr('href')
    

    Compare this to chaos's simplified regex example, which (as stated) doesn't deal with fussy/complex markup, and you'll hopefully understand why a DOM parser is better than a regex for this type of problem.

    0 讨论(0)
  • 2020-12-11 12:59

    Because HTML's syntactic rules are so loose, it's pretty difficult to do with any reliability (unless, say, you know for absolute certain that all your tags will use double quotes around their attribute values). Here's some fairly general regex-based code for the purpose:

    function extract_urls($html) {
        $html = preg_replace('<!--.*?-->', '', $html);
        preg_match_all('/<a\s+[^>]*href="([^"]+)"[^>]*>/is', $html, $matches);
        foreach($matches[1] as $url) {
            $url = str_replace('&amp;', '&', trim($url));
            if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
                $urls[] = $url;
        }
        preg_match_all('/<a\s+[^>]*href=\'([^\']+)\'[^>]*>/is', $html, $matches);
        foreach($matches[1] as $url) {
            $url = str_replace('&amp;', '&', trim($url));
            if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
                $urls[] = $url;
        }
        preg_match_all('/<a\s+[^>]*href=([^"\'][^> ]*)[^>]*>/is', $html, $matches);
        foreach($matches[1] as $url) {
            $url = str_replace('&amp;', '&', trim($url));
            if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
                $urls[] = $url;
        }
        return $urls;
    }
    
    0 讨论(0)
  • 2020-12-11 13:21

    Regex:

    <a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>
    

    [Note: \s* is used in several places to match the extra white space characters that can occur in the html.]

    Sample C# code:

    /// <summary>
    /// Assigns proper values to link and name, if the htmlId matches the pattern
    /// Matches only for .wmv files
    /// </summary>
    /// <returns>true if success, false otherwise</returns>
    public static bool TryGetHrefDetailsWMV(string htmlATag, out string wmvLink, out string name)
    {
        wmvLink = null;
        name = null;
    
        string pattern = "<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>";
    
        if (Regex.IsMatch(htmlATag, pattern))
        {
            Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
            wmvLink = r.Match(htmlATag).Result("${link}");
            name = r.Match(htmlATag).Result("${name}");
            return true;
        }
        else
            return false;
    }
    
    MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file'>Name of File</a></td>", 
                    out wmvLink, out name); // No match
    MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv'>Name of File</a></td>",
                    out wmvLink, out name); // Match
    MyRegEx.TryGetHrefDetailsWMV("<td><a    href='/path/to/file.wmv'   >Name of File</a></td>", out wmvLink, out name); // Match
    
    0 讨论(0)
提交回复
热议问题