How to clean up poorly formed HTML using HTML Agility Pack

瘦欲@ 提交于 2019-12-12 07:57:04

问题


I am attempting to replace this god awful collection of regular expressions that is currently used to clean up blocks of poorly formed HTML and stumbled upon the HTML Agility Pack for C#. It looks very powerful but yet, I couldn't find an example of how I want to use the pack which, in my mind, would be a desired functionality included in it. I am sure I am an idiot and cannot find a suitable method in the documentation.

Let me explain... say I had the following html:

<p class="someclass">
    <font size="3">
        <font face="Times New Roman">
            this is some text
            <a href="somepage.html">Some link</a>
        </font>
    </font>
</p>

... that I want to look like:

<p>
    this is some text
    <a href="somepage.html">Some link</a>
</p>

When I utilize the HtmlNode.Remove() method it removes the node plus all it's children. Is there a way to remove the node preserving the children?


回答1:


On HtmlNode, the method RemoveChild has this overload:

public HtmlNode RemoveChild(HtmlNode oldChild, bool keepGrandChildren);

So this is how you would do it:

HtmlDocument doc = new HtmlDocument();
doc.Load("yourfile.htm");

foreach (HtmlNode font in doc.DocumentNode.SelectNodes("//font"))
{
    font.ParentNode.RemoveChild(font, true);
}

EDIT: It looks like the Replace w/ keepGrandChildren option is not working as expected, so here is an alternate implementation:

public static HtmlNode RemoveChild(HtmlNode parent, HtmlNode oldChild, bool keepGrandChildren)
{
    if (oldChild == null)
        throw new ArgumentNullException("oldChild");

    if (oldChild.HasChildNodes && keepGrandChildren)
    {
        HtmlNode prev = oldChild.PreviousSibling;
        List<HtmlNode> nodes = new List<HtmlNode>(oldChild.ChildNodes.Cast<HtmlNode>());
        nodes.Sort(new StreamPositionComparer());
        foreach (HtmlNode grandchild in nodes)
        {
            parent.InsertAfter(grandchild, prev);
        }
    }
    parent.RemoveChild(oldChild);
    return oldChild;
}

// this helper class allows to sort nodes using their position in the file.
private class StreamPositionComparer : IComparer<HtmlNode>
{
    int IComparer<HtmlNode>.Compare(HtmlNode x, HtmlNode y)
    {
        return y.StreamPosition.CompareTo(x.StreamPosition);
    }
}



回答2:


You could try using AngleSharp instead.

var parser = new HtmlParser();
var document = parser.Parse(html);

using (var writer = new StringWriter())
{
    document.ToHtml(writer, new PrettyMarkupFormatter());
    return writer.ToString();
}



回答3:


Once you find the

element use the InnerText method to get the text, Then do the remove and then insert the text.



来源:https://stackoverflow.com/questions/5372856/how-to-clean-up-poorly-formed-html-using-html-agility-pack

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!