I am processing a large directory every night. It accumulates around 1 million files each night, half of which are .txt
files that I need to move to a different directory according to their contents.
Each .txt
file is pipe-delimited and contains only 20 records. Record 6 is the one that contains the information I need to determine which directory to move the file to.
Example Record:
A|CHNL_ID|4
In this case the file would be moved to /out/4
.
This script is processing at a rate of 80,000 files per hour.
Are there any recommendations on how I could speed this up?
opendir(DIR, $dir) or die "$!\n";
while ( defined( my $txtFile = readdir DIR ) ) {
next if( $txtFile !~ /.txt$/ );
$cnt++;
local $/;
open my $fh, '<', $txtFile or die $!, $/;
my $data = <$fh>;
my ($channel) = $data =~ /A\|CHNL_ID\|(\d+)/i;
close($fh);
move ($txtFile, "$outDir/$channel") or die $!, $/;
}
closedir(DIR);
Try something like:
print localtime()."\n"; #to find where time is spent
opendir(DIR, $dir) or die "$!\n";
my @txtFiles = map "$dir/$_", grep /\.txt$/, readdir DIR;
closedir(DIR);
print localtime()."\n";
my %fileGroup;
for my $txtFile (@txtFiles){
# local $/ = "\n"; #\n or other record separator
open my $fh, '<', $txtFile or die $!;
local $_ = join("", map {<$fh>} 1..6); #read 6 records, not whole file
close($fh);
push @{ $fileGroup{$1} }, $txtFile
if /A\|CHNL_ID\|(\d+)/i or die "No channel found in $_";
}
for my $channel (sort keys %fileGroup){
moveGroup( @{ $fileGroup{$channel} }, "$outDir/$channel" );
}
print localtime()." finito\n";
sub moveGroup {
my $dir=pop@_;
print localtime()." <- start $dir\n";
move($_, $dir) for @_; #or something else if each move spawns sub process
#rename($_,$dir) for @_;
}
This splits the job into three main parts where you can time each part to find where most time is spent.
You are being hurt by the sheer number of files in a single directory.
I created 80_000
files and ran your script which completed in 5.2 seconds. This is on an older laptop with CentOS7 and v5.16. But with half a million files† it takes nearly 7 minutes. Thus the problem is not about the performance of your code per se (but which can also be tightened).
Then one solution is simple: run the script out of a cron, say every hour, as files are coming. While you move the .txt
files also move the others elsewhere and there will never be too many files; the script will always run in seconds. In the end you can move those other files back, if needed.
Another option is to store these files on a partition with a different filesystem, say ReiserFS. However, this doesn't at all address the main problem of having way too many files in a directory.
Another partial fix is to replace
while ( defined( my $txtFile = readdir DIR ) )
with
while ( my $path = <"$dir/*txt"> )
which results in a 1m:12s run (as opposed to near 7 minutes). Don't forget to adjust file-naming since <>
above returns the full path to the file. Again, this doesn't really deal with the problem.
If you had control over how the files are distributed you would want a 3-level (or so) deep directory structure, which can be named using files' MD5, what would result in a very balanced distribution.
† File names and their content were created as
perl -MPath::Tiny -wE'
path("dir/s".$_.".txt")->spew("A|some_id|$_\n") for 1..500_000
'
This is the sort of task that I often perform. Some of these were already mentioned in various comments. None of these are special to Perl and the biggest wins will come from changing the environment rather than the language.
Segment files into separate directories to keep the directories small. Larger directories take longer to read (sometimes exponentially). This happens in whatever produces the files. The file path would be something like .../ab/cd/ef/filename.txt where the ab/cd/ef come from some function that has unlikely collisions. Or maybe it's like .../2018/04/01/filename.txt.
You probably don't have much control over the producer. I'd investigate making it add lines to a single file. Something else makes separate files out of that later.
Run more often and move processed files somewhere else (again, possibly with hashing.
Run continually and poll the directory periodically to check for new files.
Run the program in parallel. If you have a lot of idle cores, get them working on it to. You'd need something to decide who gets to work on what.
Instead of creating files, shove them into a lightweight data store, such as Redis. Or maybe a heavyweight data store.
Don't actually read the file contents. Use File::Mmap instead. This is often a win for very large files but I haven't played with it much on large collections of small files.
Get faster spinning disks or maybe an SSD. I had the misfortune where I accidentally created millions of files in a single directory on a slow disk.
I don't think anyone brought it up but have you considered running a long term process that uses filesystem notifications as near realtime events, instead of processing in batch? Im sure CPAN will have something for Perl 5, there is a built in object in Perl 6 for this to illustrate what I mean https://docs.perl6.org/type/IO::Notification Perhaps someone else can chime in on what is a good module to use in P5?
来源:https://stackoverflow.com/questions/49332466/perl-program-to-efficiently-process-500-000-small-files-in-a-directory