Parsing Large XML files w/ Ruby & Nokogiri

后端未结

关注

 5  1518

I have a large XML file (about 10K rows) I need to parse regularly that is in this format:


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-02 08:09
              
            
            
                                                                       
You can dramatically decrease your time to execute by changing your code to the following.  Just change the "99" to whatever category you want to check.:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount


This took about three seconds on my machine.  I think a key error you made was that you chose the "items" iterate over instead of creating a collection of the "item" nodes.  That made your iteration code awkward and slow.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-02 08:18
              
            
            
                                                                       
I'd recommend using a SAX parser rather than a DOM parser for a file this large. Nokogiri has a nice SAX parser built in: http://nokogiri.org/Nokogiri/XML/SAX.html

The SAX way of doing things is nice for large files simply because it doesn't build a giant DOM tree, which in your case is overkill; you can build up your own structures when events fire (for counting nodes, for example).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2021-01-02 08:21
              
            
            
                                                                       
You may like to try this out - https://github.com/amolpujari/reading-huge-xml

HugeXML.read xml, elements_lookup do |element|
  # => element{ :name, :value, :attributes}
end


I also tried using ox.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-02 08:30
              
            
            
                                                                       
Here's an example comparing a SAX parser count with a DOM-based count, counting 500,000 <item>s with one of seven categories. First, the output:


  Create XML file: 1.7s

  Count via SAX: 12.9s

  Create DOM: 1.6s

  Count via DOM: 2.5s  


Both techniques produce the same hash counting the number of each category seen:

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}


The SAX version takes 12.9s to count and categorize, while the DOM version takes only 1.6s to create the DOM elements and 2.5s more to find and categorize all the <cat> values. The DOM version is around 3x as fast!

…but that's not the entire story. We have to look at RAM usage as well.


For 500,000 items SAX (12.9s) peaks at 238MB of RAM; DOM (4.1s) peaks at 1.0GB.  
For 1,000,000 items SAX (25.5s) peaks at 243MB of RAM; DOM (8.1s) peaks at 2.0GB.  
For 2,000,000 items SAX (55.1s) peaks at 250MB of RAM; DOM (???) peaks at 3.2GB.


I had enough memory on my machine to handle 1,000,000 items, but at 2,000,000 I ran out of RAM and had to start using virtual memory. Even with an SSD and a fast machine I let the DOM code run for almost ten minutes before finally killing it. 

It is very likely that the long times you are reporting are because you are running out of RAM and hitting the disk continuously as part of virtual memory. If you can fit the DOM into memory, use it, as it is FAST. If you can't, however, you really have to use the SAX version.

Here's the test code:

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0




How does the DOM Counting Work?

If we strip away some of the test structure, the DOM-based counter looks like this:

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end


How does the SAX counting Work?

First, let's focus on this code:

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end


When we create a new instance of this class we get an object that has a Hash that defaults to 0 for all values, and a couple of methods that can be called on it. The SAX Parser will call these methods as it runs through the document.


Each time the SAX parser sees a new element it will call the start_element method on this class. When that happens, we set a flag based on whether this element is named "cat" or not (so that we can find the name of it later).
Each time the SAX parser slurps up a chunk of text it calls the characters method of our object. When that happens, we check to see if the last element we saw was a category (i.e. if @count was set to true); if so, we use the value of this text node as the category name and add one to our counter.


To use our custom object with Nokogiri's SAX parser we do this:

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2021-01-02 08:31
              
            
            
                                                                       
Check out Greg Weber's version of Paul Dix's sax-machine gem:
http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

Parsing large file with SaxMachine seems to be loading the whole file into memory

sax-machine makes the code much much simpler; Greg's variant makes it stream.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复