parsing/scanning through a 17gb xml file

前端未结

关注

 3  1338

I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-19 17:07
              
            
            
                                                                       
Using PHP xmlreader seems to be the right thing to do.

Reason:
 Because of your statement: 


  I have to 'group' each question with their answers. Basically find a
  question (posttypeid=1) find its answers using parentId of another row
  and store it in db.


What I understand is you like to build a database with questions an answers. Therefore, there is no reason to do the "grouping" on the XML level. Put all relevant information in the database and do the grouping on the DB level - with db commands (sql ...). 

What you have to is  use something like "Using the target parser method" E.g [High-performance XML parsing in Python with xml (Even if it is for Python, it's a good start). This should be possible with XMLReader.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-12-19 17:08
              
            
            
                                                                       

  I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?


Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row> element. You can decide per the attribute which database (table?) you would like to insert into.

I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach over the XMLReader so that the code is often easier to read and write:

$reader = new XMLReader();
$reader->open($xmlFile);

/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
    $isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')

    $importer = $isAnswerInsteadOfQuestion 
                ? $importerAnswers 
                : $importerQuestions;

    $importer->importRowNode($post);
}


If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.

Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2020-12-19 17:10
              
            
            
                                                                       
Because the way you are processing this large file isn't sequential but requires direct access, I think the only viable option is to load the data into an XML database.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复