问题
I have a perl routine that is causing me frequent "out of memory" issues in the system.
The script does 3 things
1> get the output of a command to an array (@arr = `$command` --> array will hold about 13mb of data after the command)
2> Use a large regex to match the contents of individual array elements -->
The regex is something like this
if($new_element =~ m|([A-Z0-9\-\._\$]+);\d+\s+([0-9]+)-([A-Z][A-Z][A-Z])-([0-9][0-9][0-9][0-9]([0-9]+)\:([0-9]+)\:([0-9]+)|io)
<put to hash>
3> Put the array in a persistent hash map.
$hash_var{arr[0]} = "Some value"
edit: Sample data processed by regex are
Z4:[newuser.newdir]TESTOPEN_ERROR.COM;4
8-APR-2014 11:14:12.58
Z4:[newuser.newdir]TEST_BOC.CFG;5
5-APR-2014 10:43:11.70
Z4:[newuser.newdir]TEST_BOC.COM;20
5-APR-2014 10:41:01.63
Z4:[newuser.newdir]TEST_NEWRT.COM;17
4-APR-2014 10:30:56.11
About 10000 lines like these
I started by suspecting the array and hash together may be consuming too much of memory. However i have started to think this regex might have some thing to do with out of memory as well.
Does perl regex(with 'io' option!) really the main culprit causing out of memory?
回答1:
This has nothing to do with regexes.
If you are operating in a memory-constrained environment, you should process data records one at a time rather than fetching all of them at once. Let's assume you pull your data like:
my @data = `some command`;
for my $line (@data) {
... # process the line
}
This is incredibly wasteful because you need storage for the data, and for the output of your processing (in your case: the hash).
Instead, process the input line by line. We can use the open
function instead of backticks for this:
open my $cmd, '-|', 'some', 'command' or die "Can't run some command: $!";
while (my $line = <$cmd>) {
... # process the line
}
There is no need for an array here, which saves us 13MB of memory which we can now put to use otherwise.
回答2:
What problem are you really trying to solve? Use your words... not Perl.
Something like: "The script is picking apart the output from an openvms Directory output command and the objective is to report the number of file and oldest date ordered by directory"
First question is WHY keep the array. Will the script 'walk' it again? If not, just processes there and then in a for loop.
The regex seems to pick out out a file-name, and date. That's been does before. It is not hard, and can be simplified by trusting the OpenVMS directory format. Somethign like this reads better imho:
if($new_element =~ m|](.*);\d+\s+(\d+)-(\w+)-(\d+)\s+(\d+):(d+):(\d+)|)
: $hash_var{arr[0]} =
Hmmm, that suggests to me that a whole line from array is used as a key value, with all 50+ spaces. So those 10,000 lines tuning into 1,000,000+ bytes just for raw key bytes. A lot but not crazy. New we know that the first word on the line MUST be unique, why not exploit that: $hash_var{$1} = xxx if /(\S+)/l;
The program may also want to exploit that the leading strings are highly repetitive, and substitute everything before the "]" with an ever increasing directory number, maintained in a 'look-a-side' array and/or hash.
Personally I would drop /NOHEAD from the command, and use a regex to pick up the directories as they come by on their own lines.
Or use a SUBSTR or whatever... of course you'd need to construct a similar key on re-access.
In the related topic, there is debugging output printed. Perhaps include the line number in the array for your own understanding?
Perl encounters "out of memory" in openvms system
Good luck! Hein
来源:https://stackoverflow.com/questions/23378580/perl-regex-using-too-much-memory