How do you keep .NET XML parsers from expanding parameter entities in XML?

那年仲夏 提交于 2019-12-20 03:08:05

问题


When I try and parse the xml below (with code below) I keep getting <sgml>&question;&signature;</sgml>

expanded to

<sgml>Why couldn’t I publish my books directly in standard SGML? — William Shakespeare.</sgml>

OR

<sgml></sgml>

Since I am working on an XML 3-way Merging algorithm I would like to retrieve the un-expanded <sgml>&question;&signature;</sgml>

I have tried:

  • Parsing the xml normaly (this results in the expanded sgml tag)
  • Removing the Doctype from the beginning on the xml this results in empty sgml tag)
  • Various XmlReader DTD settings

I have the following XML file:

<!DOCTYPE sgml [
  <!ELEMENT sgml ANY>
  <!ENTITY  std       "standard SGML">
  <!ENTITY  signature " &#x2014; &author;.">
  <!ENTITY  question  "Why couldn&#x2019;t I publish my books directly in &std;?">
  <!ENTITY  author    "William Shakespeare">
]>
<sgml>&question;&signature;</sgml>

Here is the code I have tried (several attempts):

using System.IO;
using System.Xml;
using System.Xml.Linq;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        string xml = @"C:\src\Apps\Wit\MergingAlgorithmTest\MergingAlgorithmTest\Tests\XMLMerge-DocTypeExpansion\DocTypeExpansion.0.xml";
        var xmlSettingsIgnore = new XmlReaderSettings 
            {
                CheckCharacters = false,
                DtdProcessing = DtdProcessing.Ignore
            };

        var xmlSettingsParse = new XmlReaderSettings
        {
            CheckCharacters = false,
            DtdProcessing = DtdProcessing.Parse
        };

        using (var fs = File.Open(xml, FileMode.Open, FileAccess.Read))
        {
            using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsIgnore))
            {
                // Prevents Exception "Reference to undeclared entity 'question'"
                PropertyInfo propertyInfo = xmkReaderIgnore.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
                propertyInfo.SetValue(xmkReaderIgnore, true, null);

                var doc = XDocument.Load(xmkReaderIgnore);

                Console.WriteLine(doc.Root.ToString()); // outputs <sgml></sgml> not <sgml>&question;&signature;</sgml>
            }// using xml ignore

            fs.Position = 0;
            using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsParse))
            {
                var doc = XDocument.Load(xmkReaderIgnore);
                Console.WriteLine(doc.Root.ToString()); // outputs <sgml>Why couldn't I publish my books directly in standard SGML? - William Shakespeare.</sgml> not <sgml>&question;&signature;</sgml>
            }

            fs.Position = 0;
            string parseXmlString = String.Empty;
            using (StreamReader sr = new StreamReader(fs))
            {
                for (int i = 0; i < 7; ++i) // Skip DocType
                    sr.ReadLine();

                parseXmlString = sr.ReadLine();
            }

            using (XmlReader xmlReaderSkip = XmlReader.Create(new StringReader(parseXmlString),xmlSettingsParse))
            {
                // Prevents Exception "Reference to undeclared entity 'question'"
                PropertyInfo propertyInfo = xmlReaderSkip.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
                propertyInfo.SetValue(xmlReaderSkip, true, null);

                var doc2 = XDocument.Load(xmlReaderSkip); // Empty sgml tag

            }
        }//using FileStream
    }
}

回答1:


Linq-to-XML does not support modeling of entity references -- they are automatically expanded to their values (source 1, source 2). There simply is no subclass of XObject defined for a general entity reference.

However, assuming your XML is valid (i.e. the entity references exist in the DTD, which they do in your example) you can use the old XML Document Object Model to parse your XML and insert XmlEntityReference nodes into your XML DOM tree, rather than expanding the entity references into plain text:

        using (var sr = new StreamReader(xml))
        using (var xtr = new XmlTextReader(sr))
        {
            xtr.EntityHandling = EntityHandling.ExpandCharEntities; // Expands character entities and returns general entities as System.Xml.XmlNodeType.EntityReference
            var oldDoc = new XmlDocument();
            oldDoc.Load(xtr);
            Debug.WriteLine(oldDoc.DocumentElement.OuterXml); // Outputs <sgml>&question;&signature;</sgml>
            Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&question;")); // Verify that the entity references are still there - no assert
            Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&signature;")); // Verify that the entity references are still there - no assert
        }

the ChildNodes of each XmlEntityReference will have the text value of the general entity. If a general entity refers to other general entities, as one does in your case, the corresponding inner XmlEntityReference will be nested in the ChildNodes of the outer. You can then compare the old and new XML using the old XmlDocument API.

Note you also need to use the old XmlTextReader and set EntityHandling = EntityHandling.ExpandCharEntities.



来源:https://stackoverflow.com/questions/30598841/how-do-you-keep-net-xml-parsers-from-expanding-parameter-entities-in-xml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!