问题
I need to extract text from a very bad Html.
I'm trying to do this using vb.net
and HtmlAgilityPack
The tag that I need to parse has InnerText = InnerHtml and both:
Name:<!--b>=</b--> Albert E<!--span-->instein s<!--i>Y</i-->ection: 3 room: -
While debuging I can read it using "Html viewer": it shows:
Name: Albert Einstein section: 3 room: -
How can I get this into a string variable?
EDIT:
I use this code to get the node:
Dim ElePs As HtmlNodeCollection = _
mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
'Here I need to get EleP.InnerText "normalized"
Next
回答1:
If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join
is enough:
C#
var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
Select(t=>t.InnerText));
VB.net
Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
Select t.InnerText)
the html is valid, nothing bad about it, its just written by someone without a soul.
based on your update this shall do:
Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
'Here I need to get EleP.InnerText "normalized"
Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
Select t.InnerText).Trim()
Next
note the .//
it means that it will look for the descendant nodes of the current node unlike //
which will always start from the top node.
来源:https://stackoverflow.com/questions/35744250/innertext-innerhtml-how-to-extract-readable-text-with-htmlagilitypack