Parsing huge, badly encoded XML files in Python

前端 未结 4 1365
我寻月下人不归
我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

4条回答
  •  说谎
    说谎 (楼主)
    2021-01-11 15:36

    I used a similar piece of code:

     illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
    

    ...

    illegalxml.sub("?",mystring)
    

    ...

    However, this did not work for all possible strings (400+MB string).

    For a final solution I used decoding/encoding as follows:

    outxml = "C:/path_to/xml_output_file.xml"
    with open(outxml, "w") as out:
        valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
        out.write(valid_xmlstring) 
    

提交回复
热议问题