I have a Windows desktop app written in C# that loops through a bunch of XML files stored on disk and created by a 3rd party program. Most all the files are loaded and proce
The referenced file contains a character that is valid for a filename, but invalid in an XML attribute. You have a few options.
Because XmlDocument loads the entire thing as soon as it runs into an unencoded character it aborts the entire process. If you want to process what you can and skip/log duff bits, look at XmlTextReader. XmlTextReader loaded from a Filestream will load a node at a time, so it will also use a lot less memory. You could even get clever and split the thing up and parallelise the processing.
When I've had this it's been things like accented characters in there: grave, acutes, umlauts, and such.
I don't have any automated processes, so usually I just load the file in Visual Studio and edited the bad guys out until there are no squigglies left. The theory is sound though.
In order to control the encoding (once you know what it is), you can load the files using the Load
method override that accepts a Stream
.
Then you can create a new StreamReader
against your file specifying the appropriate Encoding
in the constructor.
For example, to open the file using Western European encoding, replace the following line of code in the question:
XDocument xmlDoc = XDocument.Load(inFileName);
with this code:
XDocument xmlDoc = null;
using (StreamReader oReader = new StreamReader(inFileName, Encoding.GetEncoding("ISO-8859-1"))) {
xmlDoc = XDocument.Load(oReader);
}
The list of supported encodings can be found in the MSDN documentation.
Not sure if this is your case, but this can be related to invalid byte sequences for a given encoding. Example: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences.
Try filtering invalid sequences from the file while loading.