How To Parse XML With Invalid Characters in Node Name?

旧巷老猫 提交于 2020-01-14 14:33:30

问题


So I'm trying to parse some XML, the creation of which is not under my control. The trouble is, they've somehow got nodes that look like this:

<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(MORNINGSTAR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(QUARTERSTAFF) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(SCYTHE) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRATNYR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRIPLE-HEADED_FLAIL) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(WARAXE) />

Visual Studio and .NET both feel that the '(' and ')' characters, as used above, are totally invalid. Unfortunately, I need to process these files! Is there any way to get the Xml Reader classes to not freak out at seeing these characters, or dynamically escape them or something? I could do some sort of pre-processing on the whole file, but I DO want the '(' and ')' characters if they appear inside the node in some valid way, so I don't want to just remove them all...


回答1:


That simply isn't valid. Pre-processing is your best-bet, perhaps with regex - something like:

string output = Regex.Replace(input, @"(<\w+)\((\w+)\)([ >/])", "$1$2$3");

Edit: a bit more complex to replace the "-" inside the brackets:

string output = Regex.Replace(input, @"(<\w+)\(([-\w]+)\)([ >/])",
    delegate(Match match) {
        return match.Groups[1].Value + match.Groups[2].Value.Replace('-', '_')
             + match.Groups[3].Value;
    });



回答2:


If it isn't syntactically valid, it's not XML.

XML is very strict about this.

If you can't get the sending application to send correct XML, then just let them know that whatever downstream process sees this will fail, whether it's yours or some other app in the future.

If preprocessing isn't an option, another clever mechanism is to wrap the Stream object that is passed to the parser with a custom stream. That stream could look for < characters, and when it sees one, set a flag. Until a > character is see, it could eat any ( or ) characters. We've used something like this to get rid of NUL and ^Z characters added to an XML file by a legacy transport mechanism. (The only gotcha there might be < characters inside of an attribute, since they don't have to be escaped there - only > characters do.)



来源:https://stackoverflow.com/questions/1069114/how-to-parse-xml-with-invalid-characters-in-node-name

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!