Fixing bad XML file (eg. unescaped & etc.) [duplicate]

大憨熊 提交于 2019-11-29 09:22:25

The key message below is that unless you know the exact format of the input file, and have guarantees that any deviation from XML is consistent, you can't programmatically fix without risking that your fixes will be incorrect.

Fixing it by replacing & with & is an acceptable solution if and only if:

  1. There is no acceptable well-formed source of these data.

    • As @Darin Dimitrov comments, try to find a better provider, or get this provider to fix it.
    • JSON (for example) is preferable to poorly formed XML, even if you aren't using javascript.
  2. This is a one off (or at least extremely infrequent) import.

    • If you have to fetch this in at runtime, then this solution will not work.
  3. You can keep iterating through, devising new fixes for it, adding a solution to each problem as you come across it.

    • You will probably find that once you have "fixed" it by escaping & characters, there will be other errors.
  4. You have the resources to manually check the integrity of the "fixed" data.

    • The errors you "fix" may be more subtle than you realise.
  5. There are no correctly formatted entities in the document -

    • Simply replacing & with & will erroneously change " to ". You may be able to get around this, but don't be naive about how tricky it might be (entities may be defined in a DTD, may refer to a unicode code-point ...)

    • If it is a particular element that misbehaves, you could consider wrapping the content of the element with <![CDATA ]]>, but that still relies on you being able to find the start and end tags reliably.

Start by changing your mindset. The input is not XML, so don't call it XML. Don't even use "xml" to tag your questions about it. The fact that it isn't XML means that you can't use any XML tools with it, and you can't get any of the benefits of XML data interchange. You're dealing with a proprietary format that comes without a specification and without any tools. Treat it as you would any other proprietary format - try to discover a specification for what you are getting, and write a parser for it.

If you know the tags of the file and want to "okay" the text inside the tags that could have suspect data, you could do something like this:

private static string FixBadXmlText(string xmlText)
{           
    var unreliableTextTags = new[] { "message", "otherdata", "stacktrace", "innerexception" };

    foreach(var tag in unreliableTextTags)
    {
        string openTag = "<" + tag + ">";
        string closeTag = "</" + tag + ">";
        xmlText = xmlText.Replace(openTag, openTag + "<![CDATA[").Replace(closeTag, "]]>" + closeTag);
    }

    return xmlText;
}

Anything inside a CDATA Section (<![CDATA[ {your text here} ]]>) will not be interpreted by an XML parser so doesn't need to be escaped. This helped me when wanting to parse some poorly made XML that didn't properly escape the input.

Since your starting XML is erroneous you can't use any XmlReaders because they can't read it correctly.

If only the values of the XML nodes aren't htmlEncoded, than you'd have to go and manually read line, parse (get the xml node name and it's value), encode and output to a new file.

Often times we end up in a similar situation so I understand your pains - most of the time though, the errors have some "rule", so I'm guessing here they didn't encode the Business Name (and maybe the street name), so you can just search for that string <naziv>, and it's closing tag </naziv> and HtmlEncode everything in between. Also, since it's business name, it won't have line breaks, which can ease your life quite a bit...

You could try something with regular expressions depending on how complex the structure is:

Regex mainSplitter = new Regex("<komitent ID=\"([0-9]*)\">(.*?)</komitent>");
Regex nazivFinder = new Regex("<naziv>(.*?)</naziv>");

foreach (Match item in mainSplitter.Matches(test))
{
    Console.WriteLine(item);

    string naziv = null;

    Match node = nazivFinder.Match(item.Groups[2].Value);
    if (node != null)
        naziv = node.Groups[1].Value;
}

You can handle the file as XPL and even use the XPL parser to transform such files into valid XML. XPL (eXtensible Process Language) is just like XML but the parser allows XML's "special characters" in text fields. So, you can in fact run an invalid XML file (invalid due to special characters) through the XPL process. In some cases, you can use the XPL processor instead of an XML processor. You can also use it to preprocess the invalid files without any performance loss. Artificial Intelligence, XML, and Java Concurrency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!