Splitting large text file on every blank line

后端 未结 9 904
南方客
南方客 2020-12-05 07:53

I\'m having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:



        
相关标签:
9条回答
  • 2020-12-05 08:31

    You can also try split -p "^$"

    0 讨论(0)
  • 2020-12-05 08:34

    Perl has a useful feature called the input record separator. $/.

    This is the 'marker' for separating records when reading a file.

    So:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    local $/ = "\n\n"; 
    my $count = 0; 
    
    while ( my $chunk = <> ) {
        open ( my $output, '>', "filename_".$count++ ) or die $!;
        print {$output} $chunk;
        close ( $output ); 
    }
    

    Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work.

    This can be reduced to a one liner:

    perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;'  yourfilename_here
    
    0 讨论(0)
  • 2020-12-05 08:34

    In case you get "too many open files" error as follows...

    awk: whatever-18.txt makes too many open files
     input record number 18, file file.txt
     source line number 1
    

    You may need to close newly created file, before creating a new one, as follows.

    awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
    
    0 讨论(0)
  • 2020-12-05 08:34

    You could use the csplit command:

    csplit \
        --quiet \
        --prefix=whatever \
        --suffix-format=%02d.txt \
        --suppress-matched \
        infile.txt /^$/ {*}
    

    POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit.

    This is what the options do:

    • --quiet – suppress output of file sizes
    • --prefix=whatever – use whatever instead fo the default xx filename prefix
    • --suffix-format=%02d.txt – append .txt to the default two digit suffix
    • --suppress-matched – don't include the lines matching the pattern on which the input is split
    • /^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
    0 讨论(0)
  • 2020-12-05 08:41

    You can use this awk,

    awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
    

    (OR)

    awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
    

    More readable format:

    BEGIN {
            file="content"++i".txt"
    }
    !NF {
            file="content"++i".txt";
            next
    }
    {
            print > file
    }
    
    0 讨论(0)
  • 2020-12-05 08:44

    Since it's Friday and I'm feeling a bit helpful... :)

    Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.

    use strict;
    use warnings;
    
    # slurp file
    local $/ = undef;
    open my $fh, '<', 'test.txt' or die $!;
    my $text = <$fh>;
    close $fh;
    
    # split on double new line
    my @chunks = split(/\n\n/, $text);
    
    # make new files from chunks
    my $count = 1;
    for my $chunk (@chunks) {
        open my $ofh, '>', "whatever$count.txt" or die $!;
        print $ofh $chunk, "\n";
        close $ofh;
        $count++;
    }
    

    The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.

    0 讨论(0)
提交回复
热议问题