How to remove non-ascii characters from XML data

青春壹個敷衍的年華 提交于 2019-12-20 06:17:32

问题


I have some XML data that is in the following format. My application is supposed to read this using a XMLReader and do some processing to it . However, for that to happen, I need to remove or replace the first portion of each line, specifically the <���.

<���<XML>....data....</XML>
<���<XML>....data....</XML
<���<XML>....data....</XML>    
and so on...

I tried the following after looking at some posts in SO but no success so far. Any help will be appreciated!

private static Regex _invalidXMLChars = new Regex(
@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);

        static string ReplaceHexadecimalSymbols(string txt)
        {
            return _invalidXMLChars.Replace(txt, string.Empty);
        }

Note: I took my XML data which is in .txt format and tried calling the function on each line but it did not work.. the characters were still there after calling the function.


回答1:


I would investigate as to why these characters are there in the first place. It looks like some encoding problem somewhere between the original XMLs and your file.

Anyway, when you read a line, just drop all the characters before the <XML>.



来源:https://stackoverflow.com/questions/31327410/how-to-remove-non-ascii-characters-from-xml-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!