Troubles while parsing with python very large xml file

后端未结

关注

 3  1567

I have a large xml file (about 84MB) which is in this form:


    ...
    ....
    ...


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2021-01-15 06:04
              
            
            
                                                                       
Try with lxml which is more easy to use.

#!/usr/bin/env python
from lxml import etree

with open("myfile.xml") as fp:
    tree = etree.parse(fp)
    root = tree.getroot()

    print root.tag

    for book in root:
        print book.text

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-15 06:13
              
            
            
                                                                       
I would strongly recommend using a SAX parser here.  I wouldn't recommend using minidom on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size.  I suspect the problems you are having are being caused by minidom requesting too much memory.

Python comes with an XML SAX parser.  To use it, do something like the following.

from xml.sax.handlers import ContentHandler
from xml.sax import parse

class MyContentHandler(ContentHandler):
    # override various ContentHandler methods as needed...


handler = MyContentHandler()
parse("mydata.xml", handler)


Your ContentHandler subclass will override various methods in ContentHandler (such as startElement, startElementNS, endElement, endElementNS or characters.  These handle events generated by the SAX parser as it reads your XML document in.

SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside.  On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.

I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document.  That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.

Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-01-15 06:14
              
            
            
                                                                       
There are 2 species of XML parsers (this applies to any language). 


DOM parsing (which is what you are using). In this type the whole XML file is read into a memory structures and then accessed by methods.
SAX parsing. This is a parsing algorithm that reads each piece of XML in a step-wise fashion. This technique would allow you to better detect and deal with errors.


In general DOM is easier than SAX because a lot of the gritty details are dealt with by its native methods.

SAX is a bit more of a challenge because you have to code methods that the SAX parsing "runs" during is walk of the XML document.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复