HTML Agility pack removes break tag close

蹲街弑〆低调 提交于 2019-12-21 07:03:20

问题


I am creating an HTML document using HTML agility pack. I load a template file then append content to it. All of this works, but when I view the output file it has removed the closing tag from my <br/> tags to look like this <br>. What is causing this?

Dim doc As New HtmlDocument()
doc.Load(Server.MapPath("Template.htm"))

Dim title As HtmlNode = doc.DocumentNode.SelectSingleNode("//title")

title.InnerHtml = title.InnerHtml & "CEU Classes"
Dim topContent As HtmlAgilityPack.HtmlNode = doc.GetElementbyId("topContent")

topContent.InnerHtml = html.ToString
doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

More info:

It was removing my closing image tags, after I added doc.OptionWriteEmptyNodes = True, it quite doing that.

Update

This is my code as it stands now that removes the closing BR tag

Dim html As String = "Words<br/>more words"
Dim doc As New HtmlDocument()
Dim title As HtmlNode
Dim topContent As HtmlNode

HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.Load(Server.MapPath("Template.htm"))

Title = doc.DocumentNode.SelectSingleNode("//title")
title.InnerHtml = title.InnerHtml & "CEU Classes"

topContent = doc.GetElementbyId("topContent")
topContent.InnerHtml = html.ToString

doc.OptionWriteEmptyNodes = True
doc.Save(outputFileName, Encoding.UTF8)

Update 2

I ended up just reading in my template file as a standard string then loading the html like this

Dim TemplateHTML As String = File.ReadAllText(Server.MapPath("Template.htm"))

TemplateHTML = TemplateHTML.Insert(TemplateHTML.IndexOf("<div id=""topContent"">") + "<div id=""topContent"">".Length, _
                                   html.ToString)

doc.LoadHtml(TemplateHTML)

回答1:


It happens because the Html Agility Pack handles the BR in a special way. It still supports old (but existing on the web today) HTML 3.2 syntax where the BR could be declared without a closing tag at all (browsers also still handle it gracefully by the way...).

To change this default behavior, you need to modify the HtmlNode.ElementFlags property, like this:

Dim doc As New HtmlDocument()
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.LoadHtml("<test>before<br/>after</test>")
doc.OptionWriteEmptyNodes = True   
doc.Save(Console.Out)

which will display:

<test>before<br />after</test>



回答2:


As per @Simon Mourier, the following C# code works in version 1.4

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml("Lorem ipsum dolor sit<br/>Lorem ipsum dolor sit");

var postParsed = doc.DocumentNode.WriteTo();

has the following string value for postParsed

"Lorem ipsum dolor sit<br />Lorem ipsum dolor sit"



回答3:


Seems this is a standard setting in Html Agility Pack. By default, it does not conform to XHTML and many tags are not closed.

There are 2 ways to do this. At the document level you can do the following which will turn on ALL closing tags. (This is my preferred method).

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(content);

However, this may not be desirable. There is another way to do it at the node level.

if (HtmlNode.ElementsFlags.ContainsKey("img"))
{
    HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;
}
else
{
    HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);
}



回答4:


I have encountered same kind of problem and I solved it by manually re-parsing HTML chunk using new HtmlDocument object with correct settings.

Problem as I see it is that HtmlDocument has all those nice settings to let you close
tags etc, but when you select a node or do some other soft of operation with nodes and use their OuterHtml or InnerHtml some of those closing tags are lost (probably because those properties do not use same settings as document itself, or meybe there is some other reason). So when you get that incorrect html string from InnerHtml or OuterHtml, you can just re-parse it with HtmlDocument again and use document.DocumentElement.InnerHtml to get correct HTML string.



来源:https://stackoverflow.com/questions/5556089/html-agility-pack-removes-break-tag-close

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!