I\'m having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:
You can also try split -p "^$"
Perl has a useful feature called the input record separator. $/
.
This is the 'marker' for separating records when reading a file.
So:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
my $count = 0;
while ( my $chunk = <> ) {
open ( my $output, '>', "filename_".$count++ ) or die $!;
print {$output} $chunk;
close ( $output );
}
Just like that. The <>
is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed
or grep
work.
This can be reduced to a one liner:
perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
In case you get "too many open files" error as follows...
awk: whatever-18.txt makes too many open files
input record number 18, file file.txt
source line number 1
You may need to close newly created file, before creating a new one, as follows.
awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
You could use the csplit
command:
csplit \
--quiet \
--prefix=whatever \
--suffix-format=%02d.txt \
--suppress-matched \
infile.txt /^$/ {*}
POSIX csplit
only uses short options and doesn't know --suffix
and --suppress-matched
, so this requires GNU csplit
.
This is what the options do:
--quiet
– suppress output of file sizes--prefix=whatever
– use whatever
instead fo the default xx
filename prefix--suffix-format=%02d.txt
– append .txt
to the default two digit suffix--suppress-matched
– don't include the lines matching the pattern on which the input is split/^$/ {*}
– split on pattern "empty line" (/^$/
) as often as possible ({*}
)You can use this awk
,
awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
(OR)
awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
More readable format:
BEGIN {
file="content"++i".txt"
}
!NF {
file="content"++i".txt";
next
}
{
print > file
}
Since it's Friday and I'm feeling a bit helpful... :)
Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.
use strict;
use warnings;
# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;
# split on double new line
my @chunks = split(/\n\n/, $text);
# make new files from chunks
my $count = 1;
for my $chunk (@chunks) {
open my $ofh, '>', "whatever$count.txt" or die $!;
print $ofh $chunk, "\n";
close $ofh;
$count++;
}
The perl
docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.