Making XmlReaderSettings CheckCharacters work for xml string

荒凉一梦 提交于 2019-12-11 07:36:24

问题


I have an xml string coming from Adobe PDF AcroForms, which apparently allows naming form fields starting with numeric characters. I'm trying to parse this string to an XDocument:

XDocument xDocument = XDocument.Parse(xmlString);

But whenever I encounter such a form field where the name starts with a numeric char, the xml parsing throws an XmlException:

Name cannot begin with the 'number' character

Other solutions I found were about using: XmlReaderSettings.CheckCharacters

using (XmlReader xmlReader = XmlReader.Create(new StringReader(xmlString), new XmlReaderSettings() { CheckCharacters = false }))
{
    XDocument xDocument = XDocument.Load(xmlReader);
}

But this also didn't work. Some articles pointed out the reason as one of the points mentioned in MSDN article:

If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.

So I tried using:

using(MemoryStream memoryStream = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(xmlString)))
using (XmlReader xmlReader = XmlReader.Create(memoryStream, new XmlReaderSettings() { CheckCharacters = false }))
{
    XDocument xDocument = XDocument.Load(xmlReader);
}

This also didn't work. Can any one please help me in figuring out how to parse an xml string that contains xml elements whose name starts with numeric characters? How is the flag XmlReaderSettings.CheckCharacters supposed to be used?


回答1:


You can't make standard XML parser parse your format even if it "looks like" XML, stop trying. Standard-compliant XML parsers are disallowed to parse invalid XML. This was a design decision, based on all the problems quirks mode caused with HTML parsing.

Writing your own parser isn't that hard. XML is very strict and, unless you need advanced features, the syntax is simple.

  1. LL parser can be written by hand. Both lexer and parser are simple.

  2. LR parser can be generated using ANTLR and a simple grammar. Most likely, you'll even find example XML garmmars.

  3. You can also just take either of .NET XML parsers' source code and remove validation you don't need. You can find both XmlDocument and XDocument in .NET Core's repository on GitHub.



来源:https://stackoverflow.com/questions/48532818/making-xmlreadersettings-checkcharacters-work-for-xml-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!