Parsing huge, badly encoded XML files in Python

前端 未结 4 1366
我寻月下人不归
我寻月下人不归 2021-01-11 15:14

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream

相关标签:
4条回答
  • 2021-01-11 15:36

    I used a similar piece of code:

     illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
    

    ...

    illegalxml.sub("?",mystring)
    

    ...

    However, this did not work for all possible strings (400+MB string).

    For a final solution I used decoding/encoding as follows:

    outxml = "C:/path_to/xml_output_file.xml"
    with open(outxml, "w") as out:
        valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
        out.write(valid_xmlstring) 
    
    0 讨论(0)
  • 2021-01-11 15:42

    Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

    invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')
    

    And I wrote this piece of code that removes illegal bytes while saving an xml feed:

    conn = urllib2.urlopen(xmlfeed)
    xmlfile = open('output', 'w')
    
    while True:
        data = conn.read(4096)
        if data:
            newdata, count = invalid_xml.subn('', data)
            if count > 0 :
                print 'Removed %s illegal characters from XML feed' % count
            xmlfile.write(newdata)
    
        else:
            break
    
    xmlfile.close()
    
    0 讨论(0)
  • 2021-01-11 15:43

    If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:

    import codecs
    from lxml import etree
    events = ("start", "end")
    reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
    context = etree.iterparse(reader, events=events)
    

    This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.

    0 讨论(0)
  • 2021-01-11 15:46

    I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like &#x0, &#xE are not allowed. And the rule is that all character composition as regular expression '&#x[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :

    import re
    
    invalid_xml = re.compile(r'&#x[0-1]?[0-9a-eA-E];')
    
    new_file = open('new_file.xml','w') 
    with open('old_file.xml') as f:
        for line in f:
            nline, count = invalid_xml.subn('',line)
            new_file.write(nline) 
    new_file.close()
    
    0 讨论(0)
提交回复
热议问题