问题
I have an error "Out of memory" while parsing large (100 Mb) XML file
use strict;
use warnings;
use XML::Twig;
my $twig=XML::Twig->new();
my $data = XML::Twig->new
->parsefile("divisionhouserooms-v3.xml")
->simplify( keyattr => []);
my @good_division_numbers = qw( 30 31 32 35 38 );
foreach my $property ( @{ $data->{DivisionHouseRoom}}) {
my $house_code = $property->{HouseCode};
print $house_code, "\n";
my $amount_of_bedrooms = 0;
foreach my $division ( @{ $property->{Divisions}->{Division} } ) {
next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
$amount_of_bedrooms += $division->{DivisionQuantity};
}
open my $fh, ">>", "Result.csv" or die $!;
print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
close $fh;
}
What i can do to fix this error issue?
回答1:
Handling large XML files that don't fit in memory is something that XML::Twig advertises:
One of the strengths of
XML::Twig
is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit (...)
The code posted in the question isn't making use of the strength of XML::Twig
at all (using the simplify
method doesn't make it much better than XML::Simple).
What's missing from the code are the 'twig_handlers
' or 'twig_roots
', which essentially cause the parser to focus on relevant portions of the XML document memory-efficiently.
It's difficult to say without seeing the XML whether processing the document chunk-by-chunk or just selected parts is the way to go, but either one should solve this issue.
So the code should look something like the following (chunk-by-chunk demo):
use strict;
use warnings;
use XML::Twig;
use List::Util 'sum'; # To make life easier
use Data::Dump 'dump'; # To see what's going on
my %bedrooms; # Data structure to store the wanted info
my $xml = XML::Twig->new (
twig_roots => {
DivisionHouseRoom => \&count_bedrooms,
}
);
$xml->parsefile( 'divisionhouserooms-v3.xml');
sub count_bedrooms {
my ( $twig, $element ) = @_;
my @divParents = $element->children( 'Divisions' );
my $id = $element->first_child_text( 'HouseCode' );
for my $divParent ( @divParents ) {
my @divisions = $divParent->children( 'Division' );
my $total = sum map { $_->text } @divisions;
$bedrooms{$id} = $total;
}
$element->purge; # Free up memory
}
dump \%bedrooms;
回答2:
See Processing an XML document chunk by chunk section of XML::Twig documentation, it specifically discuss how to process document part by part, allowing for large XML file processing.
来源:https://stackoverflow.com/questions/7293687/out-of-memory-while-parsing-large-100-mb-xml-file-using-perl