Html doesn't get updated with Html Agility Pack

感情迁移 提交于 2020-05-29 11:50:40

问题


I'm trying to remove the img and map element from a piece of html.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

var oldHtml = doc.DocumentNode.InnerHtml;

if (doc.DocumentNode.SelectNodes("//img[@usemap]") != null)
{
    HtmlNode img = doc.DocumentNode.SelectSingleNode("//img[@usemap]");
    img.ParentNode.RemoveChild(img);
}

if (doc.DocumentNode.SelectNodes("//map") != null)
{
    HtmlNode map = doc.DocumentNode.SelectSingleNode("//map");
    map.ParentNode.RemoveChild(map);
}

var newHtml = doc.DocumentNode.InnerHtml;

The newHtml still contains the img and map element. Do I need to do something else before the html is updated?

Here is the html that I'm trying to strip:

<p><img src="/media/8301/HD00_498x299.jpg"  width="498"  height="299" alt="HD00.JPG" usemap="#imgmap201392714219"/><br />
<br />
 <a title="Download ZIP DWG"
href="/media/8103/detailtekeningen-dwg-unidek-aero.zip"
target="_blank">Klik hier om alle DWG&nbsp;bestanden in
een&nbsp;zipfile te downloaden.</a><br />
 <a title="Download DXF"
href="/media/8104/detailtekeningen-dxf-unidek-aero.zip"
target="_blank">Klik hier om alle DXF bestanden in een zipfile te
downloaden.</a><br />
 <a title="Download PDF"
href="/media/8116/detailtekeningen-pdf-unidek-aero.zip"
target="_blank">Klik hier om alle PDF bestanden in een zipfile te
downloaden.</a><br />
<br />
 <strong><a title="Bouwdetails berekende psi-waarden"
href="/{localLink:8014}" target="_blank">Link naar de technische
bouwdetails met verbeterde eigen ψ-waarden<br />
</a></strong> &nbsp;<map name="imgmap2012104102243"
id="imgmap2012104102243">
<area title="" href="/nl/producten/hellend-dak/unidek-aero/1"
shape="rect" coords="194,419,219,439" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/2"
shape="rect" coords="221,420,246,439" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/3"
shape="rect" coords="200,302,226,320" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/4"
shape="rect" coords="209,167,234,185" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/6"
shape="rect" coords="68,46,98,67" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/7"
shape="rect" coords="102,203,129,224" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/8"
shape="rect" coords="273,339,302,360" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/9"
shape="rect" coords="387,350,417,372" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/10"
shape="rect" coords="324,341,354,363" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/11"
shape="rect" coords="223,369,252,390" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/12"
shape="rect" coords="62,270,89,294" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/13"
shape="rect" coords="93,270,119,294" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/14"
shape="rect" coords="31,94,60,114" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/14"
shape="rect" coords="79,161,106,182" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/15"
shape="rect" coords="19,150,50,171" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/15"
shape="rect" coords="82,113,110,134" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/16"
shape="rect" coords="176,231,205,253" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/17"
shape="rect" coords="147,179,176,200" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/18"
shape="rect" coords="139,235,166,257" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/19"
shape="rect" coords="204,56,231,78" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/20"
shape="rect" coords="125,135,153,157" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/21"
shape="rect" coords="265,263,290,284" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/23"
shape="rect" coords="9,202,36,225" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/24"
shape="rect" coords="39,202,65,225" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/25"
shape="rect" coords="158,80,184,101" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/26"
shape="rect" coords="188,80,213,102" target="_blank" alt="" />
</map><map id="imgmap201392714219">
<area title="" href="/nl/producten/hellend-dak/unidek-aero/1"
shape="rect" coords="265,463,279,480" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/2"
shape="rect" coords="282,466,297,480" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/3"
shape="rect" coords="213,339,237,358" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/4"
shape="rect" coords="206,204,227,220" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/6"
shape="rect" coords="113,105,135,121" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/7"
shape="rect" coords="134,246,154,262" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/8"
shape="rect" coords="299,369,319,386" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/9"
shape="rect" coords="432,409,453,425" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/10"
shape="rect" coords="363,394,385,413" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/11"
shape="rect" coords="254,406,276,422" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/12"
shape="rect" coords="105,298,122,314" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/13"
shape="rect" coords="122,298,139,314" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/14"
shape="rect" coords="53,121,77,139" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/15"
shape="rect" coords="49,165,72,182" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/16"
shape="rect" coords="195,272,214,288" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/17"
shape="rect" coords="152,212,175,230" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/18"
shape="rect" coords="160,276,180,293" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/19"
shape="rect" coords="234,88,255,105" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/20"
shape="rect" coords="132,155,158,174" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/21"
shape="rect" coords="299,294,321,311" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/23"
shape="rect" coords="40,234,55,250" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/24"
shape="rect" coords="56,233,73,251" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/25"
shape="rect" coords="185,108,202,127" target="_blank" alt="" />
<area title="" href="/nl/producten/hellend-dak/unidek-aero/26"
shape="rect" coords="203,109,219,127" target="_blank" alt="" />
</map></p>

When I debug the img and map element are found, but calling RemoveChild doesn't change the html at all. Also when I try to change an attribute or something else nothing happens.


回答1:


This works for me:

var doc = new HtmlDocument();
doc.LoadHtml(html);

var root = doc.DocumentNode;
if (root != null)
{
    var replace = false;

    images = root.SelectNodes("//img[@usemap]");
    if (images != null)
    {
        foreach (var image in images)
        {
            image.ParentNode.RemoveChild(image);
        }

        replace = true;
    }

    if (replace)
    {
        html = root.OuterHtml;
    }
}

var newhtml = html;

The image is removed from the html.




回答2:


I've just discovered that the bug with HTML Agility pack is that you can only ask for .InnerHtml once. After that, it will not update. You are asking for it twice:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

var oldHtml = doc.DocumentNode.InnerHtml;

if (doc.DocumentNode.SelectNodes("//img[@usemap]") != null)
{
    HtmlNode img = doc.DocumentNode.SelectSingleNode("//img[@usemap]");
    img.ParentNode.RemoveChild(img);
}

if (doc.DocumentNode.SelectNodes("//map") != null)
{
    HtmlNode map = doc.DocumentNode.SelectSingleNode("//map");
    map.ParentNode.RemoveChild(map);
}

var newHtml = doc.DocumentNode.InnerHtml;

If you get rid of this line:

var oldHtml = doc.DocumentNode.InnerHtml;

It should work. It seems to be a random bug with HtmlAgilityPack.

Sniffdk's solution works because he only gets .OuterHtml once. The HtmlUtilityPack guys need to fix that.




回答3:


So far I need to do this in Umbraco before the html agility pack works:

var documents = Document.GetDocumentsOfDocumentType(5125);
var document = documents.Where(x => x.Id == 5127).First();

var html = document.getProperty("content").Value.ToString();
html = html.Replace("\r\n", "");
html = umbraco.library.RemoveFirstParagraphTag(html);

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);



回答4:


Seems that HtmlAgilityPack doesn't update the HtmlDocument.DocumentNode.InnerHtml property after removing nodes. Easiest workaround is to use OuterHtml property instead of InnerHtml :

var newHtml = doc.DocumentNode.OuterHtml;

So far I always use OuterHtml property to check if changes I made produces expected result, and just realize this behavior of InnerHtml now.

UPDATE :

In the HTML sample posted you have 2 <map> elements. Your codes only remove one. Try this way to remove all <img> and <map> nodes :

if (doc.DocumentNode.SelectNodes("//img[@usemap]") != null)
{
    HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@usemap]");
    foreach (HtmlNode img in imgs)
    {
        img.ParentNode.RemoveChild(img);
    }
}

if (doc.DocumentNode.SelectNodes("//map") != null)
{
    HtmlNodeCollection maps = doc.DocumentNode.SelectNodes("//map");
    foreach (HtmlNode map in maps)
    {
        map.ParentNode.RemoveChild(map);
    }
}
var newHtml = doc.DocumentNode.OuterHtml;

[.NET Fiddle demo]



来源:https://stackoverflow.com/questions/25784761/html-doesnt-get-updated-with-html-agility-pack

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!