Iterate through an html string to find all img tags and replace the src attribute values

前端 未结 2 1079
你的背包
你的背包 2021-01-01 04:59

I have an html code as a string. I need to find all img tags in that string, read the value of each src attribute and pass it to a function, that function returns an entire

相关标签:
2条回答
  • 2021-01-01 05:37

    If I understand your need correctly you can use HtmlAgilityPack for this purpose. Using regex may cause unwanted behavior. Can you try the code below ?

    public static string DoIt()
    {
            string htmlString = "";
            using (WebClient client = new WebClient())
                htmlString = client.DownloadString("http://dean.edwards.name/my/base64-ie.html"); //This is an example source for base64 img src, you can change this directly to your source.
    
            HtmlDocument document = new HtmlDocument();
            document.LoadHtml(htmlString);
            document.DocumentNode.Descendants("img")
                                .Where(e =>
                                {
                                    string src = e.GetAttributeValue("src", null) ?? "";
                                    return !string.IsNullOrEmpty(src) && src.StartsWith("data:image");
                                })
                                .ToList()
                                .ForEach(x =>
                                {
                                    string currentSrcValue = x.GetAttributeValue("src", null);
                                    currentSrcValue = currentSrcValue.Split(',')[1];//Base64 part of string
                                    byte[] imageData = Convert.FromBase64String(currentSrcValue);
                                    string contentId = Guid.NewGuid().ToString();
                                    LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
                                    inline.ContentId = contentId;
                                    inline.TransferEncoding = TransferEncoding.Base64;
    
                                    x.SetAttributeValue("src", "cid:" + inline.ContentId);
                                });
    
    
            string result = document.DocumentNode.OuterHtml;
    }
    

    You can retrieve HtmlAgilityPack from https://www.nuget.org/packages/HtmlAgilityPack

    Hope this helps

    0 讨论(0)
  • 2021-01-01 05:44

    I think you need to iterate your code for each img fetched form the string. The following code gives you the list of all the img tags:

    public static List<string> FetchImgsFromSource(string htmlSource)
            {
                List<string> listOfImgdata = new List<string>();
                string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
                MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
                foreach (Match m in matchesImgSrc)
                {
                    string href = m.Groups[1].Value;
                    listOfImgdata.Add(href);
                }
                return listOfImgdata;
            }
    

    use this list and user logic in a loop:

    foreach (var item in listOfImgdata )
                {
                    var imageData = Convert.FromBase64String(item);
                    var contentId = Guid.NewGuid().ToString();
                    LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
                    inline.ContentId = contentId;
                    inline.TransferEncoding = TransferEncoding.Base64;
                    //Replace all img tags with the new img tag 
                    htmlBody = Regex.Replace(htmlBody, "<img.+?src=[\"'](.+?)[\"'].*?>", @"<img src='cid:" + inline.ContentId + @"'/>");
                }
    

    Hope it works for you.

    Also the best way to parse HTML dom is to use HtmlAgilityPack as mentioned by others.

    0 讨论(0)
提交回复
热议问题