How do you parse an HTML string for image tags to get at the SRC information?

前端 未结 4 1808
清酒与你
清酒与你 2020-12-08 20:26

Currently I use .Net WebBrowser.Document.Images() to do this. It requires the Webrowser to load the document. It\'s messy and takes up resources. <

相关标签:
4条回答
  • 2020-12-08 20:35

    The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Pack over on CodePlex. It rocks (and handles malformed HTML).

    Here's a snippet for iterating over links in a document:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(@"C:\Sample.HTM");
    HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");
    
    Content match = null;
    
    // Run only if there are links in the document.
    if (linkNodes != null)
    {
        foreach (HtmlNode linkNode in linkNodes)
        {
            HtmlAttribute attrib = linkNode.Attributes["href"];
            // Do whatever else you need here
        }
    }
    

    Original Blog Post

    0 讨论(0)
  • 2020-12-08 20:39

    If it's valid xhtml, you could do this:

    XmlDocument doc = new XmlDocument();
    doc.LoadXml(html);
    XmlNodeList results = doc.SelectNodes("//img/@src");
    
    0 讨论(0)
  • 2020-12-08 20:48

    If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.

    Otherwise you can try this function, that will return all image links from HtmlSource :

    public List<Uri> FetchLinksFromSource(string htmlSource)
    {
        List<Uri> links = new List<Uri>();
        string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
        MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
        foreach (Match m in matchesImgSrc)
        {
            string href = m.Groups[1].Value;
            links.Add(new Uri(href));
        }
        return links;
    }
    

    And you can use it like this :

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
    request.Credentials = System.Net.CredentialCache.DefaultCredentials;
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    if (response.StatusCode == HttpStatusCode.OK)
    {
        using(StreamReader sr = new StreamReader(response.GetResponseStream()))
        {
            List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
        }
    }
    
    0 讨论(0)
  • 2020-12-08 20:53

    If all you need is images I would just use a regular expression. Something like this should do the trick:

    Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);
    
    0 讨论(0)
提交回复
热议问题