How can I read multiple lines of a file into blocks in Perl?

后端 未结 3 744
悲哀的现实
悲哀的现实 2021-01-29 05:57

I have a file which contains the text below.

#L_ENTRY    
#LEX        
#ROOT       
#POS        
#SUBCAT     

        
相关标签:
3条回答
  • 2021-01-29 06:15

    From this and your succeeding question it's looking like you have the answer but are unaware of it

    As long as your blocks are separated by at least one blank line, you can use Perl's paragraph mode, which will hand you back the text in blocks

    Here's another, different example that I hope you understand. I've created a file called test.txt that contains the data that you posted, and opened it in paragraph mode

    The output is from Data::Dump, which I've used only to demonstrate that the resulting array contains exactly the four strings that you asked for

    Please add a comment to this solution if you need any more explanation

    use strict;
    use warnings 'all';
    use autodie;
    
    my $file = 'test.txt';
    
    my @chunks = do {
        open my $fh, '<', $file;
        local $/ = '';
        <$fh>;
    };
    
    use Data::Dump;
    dd \@chunks;
    

    output

    [
      "#L_ENTRY    <s_slash_1>\n#LEX        </>\n#ROOT       </>\n#POS        <sp>\n#SUBCAT     <slash>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
      "#L_ENTRY    <s_comma_1>\n#LEX        <,>\n#ROOT       <,>\n#POS        <sp>\n#SUBCAT     <comma>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
      "#L_ENTRY    <s_tilde_1>\n#LEX        <~>\n#ROOT       <~>\n#POS        <sp>\n#SUBCAT     <tilde>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n\n",
      "#L_ENTRY    <s_at_1>\n#LEX        <\@>\n#ROOT       <\@>\n#POS        <sp>\n#SUBCAT     <at>\n#S_LINK           <>\n#BITS    <>\n#WEIGHT      <0.1>\n#SYNONYM     <0>\n",
    ]
    
    0 讨论(0)
  • 2021-01-29 06:22

    There are two ways to do it. Firstly, you can set the "input record separator" special variable (see more here). In short, you are telling perl that a line is not terminated by a new-line char. In your case, you could set it to '#SYNONYM <0>'. Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file. So, for input data that looks like this;

    #L_ENTRY        <s_slash_1>
    #LEX         </>
    #ROOT        </>
    #POS         <sp>
    #SUBCAT      <slash>
    #S_LINK            <>
    #BITS     <>
    #WEIGHT      <0.1>
    #SYNONYM     <0>
    
    #L_ENTRY        <s_comma_1>
    #LEX         <,>
    #ROOT        <,>
    #POS         <sp>
    #SUBCAT      <comma>
    #S_LINK            <>
    #BITS     <>
    #WEIGHT      <0.1>
    #SYNONYM     <0>
    

    if you run this;

    use v5.14;
    use warnings;
    
    my $filename = "data.txt" ;
    open(my $fh, '<', $filename) or die "$filename: $!" ;
    local $/ = "#SYNONYM     <0>\n" ;
    my @chunks = <$fh> ;
    say $chunks[0] ;
    say '---' ;
    say $chunks[1] ;
    

    You get;

    #L_ENTRY        <s_slash_1>
    #LEX         </>
    #ROOT        </>
    #POS         <sp>
    #SUBCAT      <slash>
    #S_LINK            <>
    #BITS     <>
    #WEIGHT      <0.1>
    #SYNONYM     <0>
    
    ---
    
    #L_ENTRY        <s_comma_1>
    #LEX         <,>
    #ROOT        <,>
    #POS         <sp>
    #SUBCAT      <comma>
    #S_LINK            <>
    #BITS     <>
    #WEIGHT      <0.1>
    #SYNONYM     <0>
    

    A couple of notes about this;

    1. Any extra data between your records is going to "get caught in the net" and end up at the start of each record;
    2. The record separator itself is still part of the data and is at the end of each record.

    To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:

    use v5.14;
    use warnings;
    
    my $filename = "data.txt" ;
    open(my $fh, '<', $filename) or die "$filename: $!" ;
    
    my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
    my $found_stop_token  = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;
    
    my @chunks ;
    my $chunk  ;
    my $capture_mode = 0 ;
    
    while ( <$fh> )  {
        $capture_mode = 1 if /$found_start_token/ ;
        $chunk .= $_ if $capture_mode ;
        if (/$found_stop_token/) {
            push @chunks, $chunk ;
            $chunk = '' ;
            $capture_mode = 0 ;
        }
    }
    say $chunks[0] ;
    say '---' ;
    say $chunks[1] ;
    exit 0
    

    A couple of notes;

    1. The program works by string concatenation of the current line, $_, on to $chunk if we're in caputure mode.
    2. Capture mode is turned off and on using regexs in 'extended mode', /x. This allows adding whitespace to the regex for easier reading.
    3. Extra data between record will not appear in the chunks.
    4. It produces the same output as before.
    0 讨论(0)
  • 2021-01-29 06:23

    If you set the input record separator variable to the empty string, then perl will work in paragraph mode, and return a block at a time separated by one or more blank lines in the input data

    use strict;
    use warnings 'all';
    
    local $/ = '';
    
    
    my $n;
    while ( <DATA> ) {
        printf "Block %d:\n<<%s>>\n\n", ++$n, $_;
    }
    
    __DATA__
    A
    B
    C
    D
    E
    F
    
    A
    B
    C
    D
    E
    F
    

    output

    Block 1:
    <<A
    B
    C
    D
    E
    F
    
    >>
    
    Block 2:
    <<A
    B
    C
    D
    E
    F
    
    >>
    
    0 讨论(0)
提交回复
热议问题