How to prevent System.Xml.XmlException: Invalid character in the given encoding

前端 未结 4 1771
心在旅途
心在旅途 2020-11-30 11:35

I have a Windows desktop app written in C# that loops through a bunch of XML files stored on disk and created by a 3rd party program. Most all the files are loaded and proce

相关标签:
4条回答
  • 2020-11-30 11:40

    The referenced file contains a character that is valid for a filename, but invalid in an XML attribute. You have a few options.

    1. You could change the filename and rerun your third-party script.
    2. You could work with the vendor to provide a patch that safely encodes the offending characters.
    3. You could pre-validate the XML documents and remove the offending entries prior to processing.
    0 讨论(0)
  • 2020-11-30 11:44

    Because XmlDocument loads the entire thing as soon as it runs into an unencoded character it aborts the entire process. If you want to process what you can and skip/log duff bits, look at XmlTextReader. XmlTextReader loaded from a Filestream will load a node at a time, so it will also use a lot less memory. You could even get clever and split the thing up and parallelise the processing.

    When I've had this it's been things like accented characters in there: grave, acutes, umlauts, and such.

    I don't have any automated processes, so usually I just load the file in Visual Studio and edited the bad guys out until there are no squigglies left. The theory is sound though.

    0 讨论(0)
  • 2020-11-30 12:02

    In order to control the encoding (once you know what it is), you can load the files using the Load method override that accepts a Stream.

    Then you can create a new StreamReader against your file specifying the appropriate Encoding in the constructor.

    For example, to open the file using Western European encoding, replace the following line of code in the question:

    XDocument xmlDoc = XDocument.Load(inFileName);
    

    with this code:

    XDocument xmlDoc = null;
    
    using (StreamReader oReader = new StreamReader(inFileName, Encoding.GetEncoding("ISO-8859-1"))) {
        xmlDoc = XDocument.Load(oReader);
    }
    

    The list of supported encodings can be found in the MSDN documentation.

    0 讨论(0)
  • 2020-11-30 12:03

    Not sure if this is your case, but this can be related to invalid byte sequences for a given encoding. Example: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences.

    Try filtering invalid sequences from the file while loading.

    0 讨论(0)
提交回复
热议问题