Tips for reading in a complex file - Python

孤街浪徒 提交于 2019-11-27 04:54:47

问题


I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc.

The files look something like:

Program
Username: X    Laser: X     Em: X

exp 1
    sample 1
        Time: X    Notes: X
        Read 1 X data
        Read 2 X data
        # unknown number of reads
    sample 2
        Time: X    Notes: X
        Read 1 X data
        ...
    # Unknown number of samples

exp 2
    sample 1
    ...
# Unknown number of experiments, samples and reads
# The 4 spaces between certain words represent tabs

To analyse this data I need to get the data for each reading and know which sample and experiment it came from. Also, I can change the output file format but I think the way I have written it here is the easiest to read.

To read this file in to Python the best way I can think of would be to read it in row by row and search for key words with regular expressions. For example, search the row for the "exp" keyword and then record the number after it, then search for sample in the next line and so on. However, of course this would not work if a keyword was used in the 'notes' section.

So, I'm kind of stumped as to what would best suit my needs (it's hard to use something if you don't know it exists!)

Thanks for your time.


回答1:


It's a typical task for a syntactic analyzer. In this case, since

  • lexical constructs do not cross line boundaries and there's a single construct ("statement") per line. In other words, each line is a single statement
  • full syntax for a single line can be covered by a set of regexes
  • the structure of compounds (=entities connecting multiple "statements" into something bigger) is simple and straightforward

a (relatively) simple scannlerless parser based on lines, DFA and the aforementioned set of regexes can be applied:

  • set up the initial parser state (=current position relative to various entities to be tracked) and the parse tree (=data structure representing the information from the file in a convenient way)
  • for each line
    • classify it, e.g. by matching against the regexes applicable to the current state
    • use the matched regex's groups to get the line's statement's meaningful parts
    • using these parts, update the state and the parse tree

See get the path in a file inside {} by python for an example. There, I do not construct a parse tree (wasn't needed) but only track the current state.



来源:https://stackoverflow.com/questions/28476946/tips-for-reading-in-a-complex-file-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!