InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

问题

I need to extract text from a very bad Html.

I'm trying to do this using vb.net and HtmlAgilityPack

The tag that I need to parse has InnerText = InnerHtml and both:

Name:<!--b>&#61;</b--> Albert E<!--span-->instein  s<!--i>&#89;</i-->ection: 3 room: -

While debuging I can read it using "Html viewer": it shows:

Name: Albert Einstein section: 3 room: -

How can I get this into a string variable?

EDIT:

I use this code to get the node:

Dim ElePs As HtmlNodeCollection = _
    mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
Next

回答1:

If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join is enough:

var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
                                            Select(t=>t.InnerText));

VB.net

 Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
                                   Select t.InnerText)

the html is valid, nothing bad about it, its just written by someone without a soul.

based on your update this shall do:

Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
     Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
                Select t.InnerText).Trim()
Next

note the .// it means that it will look for the descendant nodes of the current node unlike // which will always start from the top node.

来源：https://stackoverflow.com/questions/35744250/innertext-innerhtml-how-to-extract-readable-text-with-htmlagilitypack

标签

html

vb.net

html-agility-pack

innerhtml

innertext