I need to get plain text from XHTML documents.
I am sure I already read somewhere here, that XDocument on WP7 does not support DTD. I cannot find it though. Well, when I try to parse XHTML with DTD using XDocument, it throws NotSuportedException. Last call in stacktrace is at System.Xml.XmlTextReaderImpl.ParseDoctypeDecl()
.
That is exactly same even if I try to use some dummy XmlResolver - it doesn't really get called. (following answer in this question).
So I assume that WP7 really doesn't support it.
Well, I need to parse XHTML docs. So far I came up with two (more or less real) solutions:
I can do that if I remove that DTD declaration. But, there can be some character entity in the XHTML, and then exception is thrown if that character entity is not one of the predefined XML entity.
So that solution works only for some XHTMLs.
I thought of using Regex. It is quite easy to remove all the html tags, but the 'entity problem' remains as I don't think it is real/good solution to do replace for all entities.
Anyone faced/solved this? Can you give me some advice or correct me if I am wrong on something? Thanks.
HTML Agility pack is a library for parsing html document, as claimed on the forum, it has a version for WP7
来源:https://stackoverflow.com/questions/5316078/parsing-xhtml-with-dtd-using-xdocument