processing text from a non-flat file (to extract information as if it *were* a flat file)

。_饼干妹妹 提交于 2019-12-03 20:09:17

This is what Python generators are all about.

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

You can use this like a standard Python iterator

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.

If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:


  grep -A 1 'subjectB'

The -A 1 command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).

You might want to use the -E option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB').

Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:


  grep -A 1 'subjectB' |grep -v 'subjectB'

And you may wish to process the variable line:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

The best option would be to modify the computer simulation to produce rectangular output. Assuming you can't do that, here's one approach:

In order to be able to use the data in R, SQL, etc. you need to convert it from hierarchical to rectangular one way or another. If you already have a parser that can convert the entire file into a rectangular data set, you are most of the way there. The next step is to add additional flexibility to your parser, so that it can filter out unwanted data records. Instead of having a file converter, you'll have a data extraction utility.

The example below is in Perl, but you can do the same thing in Python. The general idea is to maintain a clean separation between (a) parsing, (b) filtering, and (c) output. That way, you have a flexible environment, making it easy to add different filtering or output methods, depending on your immediate data-crunching needs. You can also set up the filtering methods to accept parameters (either from command line or a config file) for greater flexibility.

use strict;
use warnings;

read_file($ARGV[0], \&check_record);

sub read_file {
    my ($file_name, $check_record) = @_;
    open(my $file_handle, '<', $file_name) or die $!;
    # A data structure to hold an entire record.
    my $rec = {
        time => '',
        desc => '',
        subj => '',
        name => '',
        vars => [],
    };
    # A code reference to get the next line and do some cleanup.
    my $get_line = sub {
        my $line = <$file_handle>;
        return unless defined $line;
        chomp $line;
        $line =~ s/^\s+//;
        return $line;
    };
    # Start parsing the data file.
    while ( my $line = $get_line->() ){
        if ($line =~ /^time (\w+)/){
            $rec->{time} = $1;
            $rec->{desc} = $get_line->();
        }
        else {
            ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/;
            $rec->{vars} = [ split / +/, $get_line->() ];

            # OK, we have a complete record. Now invoke our filtering
            # code to decide whether to export record to rectangular format.
            $check_record->($rec);
        }
    }
}

sub check_record {
    my $rec = shift;
    # Just an illustration. You'll want to parameterize this, most likely.
    write_output($rec)
        if  $rec->{subj} eq 'subjectB'
        and $rec->{time} eq 't1'
    ;
}

sub write_output {
    my $rec = shift;
    print join("\t", 
        $rec->{time}, $rec->{subj}, $rec->{name},
        @{$rec->{vars}},
    ), "\n";
}

If you are lazy and have enough RAM, then I would work on a RAM disk instead of the file system as long as you need them immediately.
I do not think that Perl or awk will be faster than Python if you are just recoding your current algorithm into a different language.

awk '/time/{f=0}/subjectB/{f=1;next}f' file
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!