How can I read multiple lines of a file into blocks in Perl?

后端未结

关注

 3  744

I have a file which contains the text below.

#L_ENTRY    
#LEX        
#ROOT       
#POS        
#SUBCAT


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-29 06:15
              
            
            
                                                                       


From this and your succeeding question it's looking like you have the answer but are unaware of it

As long as your blocks are separated by at least one blank line, you can use Perl's paragraph mode, which will hand you back the text in blocks

Here's another, different example that I hope you understand. I've created a file called test.txt that contains the data that you posted, and opened it in paragraph mode

The output is from Data::Dump, which I've used only to demonstrate that the resulting array contains exactly the four strings that you asked for

Please add a comment to this solution if you need any more explanation

use strict;
use warnings 'all';
use autodie;

my $file = 'test.txt';

my @chunks = do {
    open my $fh, '<', $file;
    local $/ = '';
    <$fh>;
};

use Data::Dump;
dd \@chunks;


output

[
  "#L_ENTRY    <s_slash_1>\n#LEX        </>\n#ROOT       </>\n#POS        <sp>\n#SUBCAT     <slash>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_comma_1>\n#LEX        <,>\n#ROOT       <,>\n#POS        <sp>\n#SUBCAT     <comma>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_tilde_1>\n#LEX        <~>\n#ROOT       <~>\n#POS        <sp>\n#SUBCAT     <tilde>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
  "#L_ENTRY    <s_at_1>\n#LEX        <\@>\n#ROOT       <\@>\n#POS        <sp>\n#SUBCAT     <at>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n",
]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2021-01-29 06:22
              
            
            
                                                                       
There are two ways to do it.  Firstly, you can set the "input record separator" special variable (see more here).  In short, you are telling perl that a line is not terminated by a new-line char.  In your case, you could set it to '#SYNONYM <0>'.  Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file.  So, for input data that looks like this;

#L_ENTRY        <s_slash_1>
#LEX         </>
#ROOT        </>
#POS         <sp>
#SUBCAT      <slash>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY        <s_comma_1>
#LEX         <,>
#ROOT        <,>
#POS         <sp>
#SUBCAT      <comma>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>


if you run this;

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM     <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;


You get;

#L_ENTRY        <s_slash_1>
#LEX         </>
#ROOT        </>
#POS         <sp>
#SUBCAT      <slash>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

---

#L_ENTRY        <s_comma_1>
#LEX         <,>
#ROOT        <,>
#POS         <sp>
#SUBCAT      <comma>
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>


A couple of notes about this;


Any extra data between your records is going to "get caught in the net" and end up at the start of each record;
The record separator itself is still part of the data and is at the end of each record.


To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;

my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token  = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;

my @chunks ;
my $chunk  ;
my $capture_mode = 0 ;

while ( <$fh> )  {
    $capture_mode = 1 if /$found_start_token/ ;
    $chunk .= $_ if $capture_mode ;
    if (/$found_stop_token/) {
        push @chunks, $chunk ;
        $chunk = '' ;
        $capture_mode = 0 ;
    }
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0


A couple of notes;


The program works by string concatenation of the current line, $_, on to $chunk if we're in caputure mode.
Capture mode is turned off and on using regexs in 'extended mode', /x.  This allows adding whitespace to the regex for easier reading.
Extra data between record will not appear in the chunks.
It produces the same output as before.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2021-01-29 06:23
              
            
            
                                                                       


If you set the input record separator variable to the empty string, then perl will work in paragraph mode, and return a block at a time separated by one or more blank lines in the input data

use strict;
use warnings 'all';

local $/ = '';


my $n;
while ( <DATA> ) {
    printf "Block %d:\n<<%s>>\n\n", ++$n, $_;
}

__DATA__
A
B
C
D
E
F

A
B
C
D
E
F


output

Block 1:
<<A
B
C
D
E
F

>>

Block 2:
<<A
B
C
D
E
F

>>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复