Regular Expression to Extract the Url out of the Anchor Tag

前端未结

关注

 3  542

I want to extract the http link from inside the anchor tags? The extension that should be extracted should be WMV files only.

相关标签:

3条回答

抹茶落季

2020-12-11 12:55
I wouldn't do this with regex - I would probably use jQuery:
```
jQuery('a[href$=.wmv]').attr('href')
```
Compare this to chaos's simplified regex example, which (as stated) doesn't deal with fussy/complex markup, and you'll hopefully understand why a DOM parser is better than a regex for this type of problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2020-12-11 12:59

Because HTML's syntactic rules are so loose, it's pretty difficult to do with any reliability (unless, say, you know for absolute certain that all your tags will use double quotes around their attribute values). Here's some fairly general regex-based code for the purpose:

function extract_urls($html) {
    $html = preg_replace('<!--.*?-->', '', $html);
    preg_match_all('/<a\s+[^>]*href="([^"]+)"[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    preg_match_all('/<a\s+[^>]*href=\'([^\']+)\'[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    preg_match_all('/<a\s+[^>]*href=([^"\'][^> ]*)[^>]*>/is', $html, $matches);
    foreach($matches[1] as $url) {
        $url = str_replace('&amp;', '&', trim($url));
        if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
            $urls[] = $url;
    }
    return $urls;
}

0 讨论(0)

北海茫月

2020-12-11 13:21

Regex:

<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>

[Note: \s* is used in several places to match the extra white space characters that can occur in the html.]

Sample C# code:

/// <summary>
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// Matches only for .wmv files
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetHrefDetailsWMV(string htmlATag, out string wmvLink, out string name)
{
    wmvLink = null;
    name = null;

    string pattern = "<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>";

    if (Regex.IsMatch(htmlATag, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        wmvLink = r.Match(htmlATag).Result("${link}");
        name = r.Match(htmlATag).Result("${name}");
        return true;
    }
    else
        return false;
}

MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file'>Name of File</a></td>", 
                out wmvLink, out name); // No match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv'>Name of File</a></td>",
                out wmvLink, out name); // Match
MyRegEx.TryGetHrefDetailsWMV("<td><a    href='/path/to/file.wmv'   >Name of File</a></td>", out wmvLink, out name); // Match

0 讨论(0)