I have a file which contains the text below.
#L_ENTRY
#LEX >
#ROOT >
#POS
#SUBCAT
From this and your succeeding question it's looking like you have the answer but are unaware of it
As long as your blocks are separated by at least one blank line, you can use Perl's paragraph mode, which will hand you back the text in blocks
Here's another, different example that I hope you understand. I've created a file called test.txt
that contains the data that you posted, and opened it in paragraph mode
The output is from Data::Dump, which I've used only to demonstrate that the resulting array contains exactly the four strings that you asked for
Please add a comment to this solution if you need any more explanation
use strict;
use warnings 'all';
use autodie;
my $file = 'test.txt';
my @chunks = do {
open my $fh, '<', $file;
local $/ = '';
<$fh>;
};
use Data::Dump;
dd \@chunks;
[
"#L_ENTRY <s_slash_1>\n#LEX </>\n#ROOT </>\n#POS <sp>\n#SUBCAT <slash>\n#S_LINK <>\n#BITS <>\n#WEIGHT <0.1>\n#SYNONYM <0>\n\n",
"#L_ENTRY <s_comma_1>\n#LEX <,>\n#ROOT <,>\n#POS <sp>\n#SUBCAT <comma>\n#S_LINK <>\n#BITS <>\n#WEIGHT <0.1>\n#SYNONYM <0>\n\n",
"#L_ENTRY <s_tilde_1>\n#LEX <~>\n#ROOT <~>\n#POS <sp>\n#SUBCAT <tilde>\n#S_LINK <>\n#BITS <>\n#WEIGHT <0.1>\n#SYNONYM <0>\n\n",
"#L_ENTRY <s_at_1>\n#LEX <\@>\n#ROOT <\@>\n#POS <sp>\n#SUBCAT <at>\n#S_LINK <>\n#BITS <>\n#WEIGHT <0.1>\n#SYNONYM <0>\n",
]
There are two ways to do it. Firstly, you can set the "input record separator" special variable (see more here). In short, you are telling perl that a line is not terminated by a new-line char. In your case, you could set it to '#SYNONYM <0>'. Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file. So, for input data that looks like this;
#L_ENTRY <s_slash_1>
#LEX </>
#ROOT </>
#POS <sp>
#SUBCAT <slash>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY <s_comma_1>
#LEX <,>
#ROOT <,>
#POS <sp>
#SUBCAT <comma>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
if you run this;
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
You get;
#L_ENTRY <s_slash_1>
#LEX </>
#ROOT </>
#POS <sp>
#SUBCAT <slash>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
---
#L_ENTRY <s_comma_1>
#LEX <,>
#ROOT <,>
#POS <sp>
#SUBCAT <comma>
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
A couple of notes about this;
To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;
my @chunks ;
my $chunk ;
my $capture_mode = 0 ;
while ( <$fh> ) {
$capture_mode = 1 if /$found_start_token/ ;
$chunk .= $_ if $capture_mode ;
if (/$found_stop_token/) {
push @chunks, $chunk ;
$chunk = '' ;
$capture_mode = 0 ;
}
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0
A couple of notes;
$_
, on to $chunk
if we're in caputure mode./x
. This allows adding whitespace to the regex for easier reading.If you set the input record separator variable to the empty string, then perl will work in paragraph mode, and return a block at a time separated by one or more blank lines in the input data
use strict;
use warnings 'all';
local $/ = '';
my $n;
while ( <DATA> ) {
printf "Block %d:\n<<%s>>\n\n", ++$n, $_;
}
__DATA__
A
B
C
D
E
F
A
B
C
D
E
F
Block 1:
<<A
B
C
D
E
F
>>
Block 2:
<<A
B
C
D
E
F
>>