I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream
I used a similar piece of code:
illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
...
illegalxml.sub("?",mystring)
...
However, this did not work for all possible strings (400+MB string).
For a final solution I used decoding/encoding as follows:
outxml = "C:/path_to/xml_output_file.xml"
with open(outxml, "w") as out:
valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
out.write(valid_xmlstring)