Tips for reading in a complex file - Python

问题

I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.

The files look something like:

Program
Username: X    Laser: X     Em: X

exp 1
    sample 1
        Time: X    Notes: X
        Read 1 X data
        Read 2 X data
        # unknown number of reads
    sample 2
        Time: X    Notes: X
        Read 1 X data
        ...
    # Unknown number of samples

exp 2
    sample 1
    ...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs

To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.

To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.

So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)

Thanks for your time.

回答1:

It's a typical task for a syntactic analyzer. In this case, since

lexical constructs do not cross line boundaries and there's a single construct ("statement") per line. In other words, each line is a single statement
full syntax for a single line can be covered by a set of regexes
the structure of compounds (=entities connecting multiple "statements" into something bigger) is simple and straightforward

a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:

set up the initial parser state (=current position relative to various entities to be tracked) and the parse tree (=data structure representing the information from the file in a convenient way)
for each line
- classify it, e.g. by matching against the regexes applicable to the current state
- use the matched regex's groups to get the line's statement's meaningful parts
- using these parts, update the state and the parse tree

See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.

来源：https://stackoverflow.com/questions/28476946/tips-for-reading-in-a-complex-file-python

标签

python

python-2.7

text-parsing