I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream
I used a similar piece of code:
illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
...
illegalxml.sub("?",mystring)
...
However, this did not work for all possible strings (400+MB string).
For a final solution I used decoding/encoding as follows:
outxml = "C:/path_to/xml_output_file.xml"
with open(outxml, "w") as out:
valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
out.write(valid_xmlstring)
Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:
invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')
And I wrote this piece of code that removes illegal bytes while saving an xml feed:
conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')
while True:
data = conn.read(4096)
if data:
newdata, count = invalid_xml.subn('', data)
if count > 0 :
print 'Removed %s illegal characters from XML feed' % count
xmlfile.write(newdata)
else:
break
xmlfile.close()
If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:
import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)
This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.
I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like �,  are not allowed. And the rule is that all character composition as regular expression '&#x[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :
import re
invalid_xml = re.compile(r'&#x[0-1]?[0-9a-eA-E];')
new_file = open('new_file.xml','w')
with open('old_file.xml') as f:
for line in f:
nline, count = invalid_xml.subn('',line)
new_file.write(nline)
new_file.close()