问题
I have some XML data that is in the following format. My application is supposed to read this using a XMLReader and do some processing to it . However, for that to happen, I need to remove or replace the first portion of each line, specifically the <���
.
<���<XML>....data....</XML>
<���<XML>....data....</XML
<���<XML>....data....</XML>
and so on...
I tried the following after looking at some posts in SO but no success so far. Any help will be appreciated!
private static Regex _invalidXMLChars = new Regex(
@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
static string ReplaceHexadecimalSymbols(string txt)
{
return _invalidXMLChars.Replace(txt, string.Empty);
}
Note: I took my XML data which is in .txt format and tried calling the function on each line but it did not work.. the characters were still there after calling the function.
回答1:
I would investigate as to why these characters are there in the first place. It looks like some encoding problem somewhere between the original XMLs and your file.
Anyway, when you read a line, just drop all the characters before the <XML>
.
来源:https://stackoverflow.com/questions/31327410/how-to-remove-non-ascii-characters-from-xml-data