How to parse unstructured table-like data?

问题

I have a text file that holds some result of an operation. The data is displayed in a human-readable format (like a table). How do I parse this data so that I can form a data structure such as dictionaries with this data?

An example of the unstructured data is shown below.

===============================================================
Title
===============================================================
Header     Header Header Header  Header       Header
1          2      3      4       5            6                   
---------------------------------------------------------------
1          Yes    No     6       0001 0002    True    
2          No     Yes    7       0003 0004    False    
3          Yes    No     6       0001 0001    True    
4          Yes    No     6       0001 0004    False    
4          No     No     4       0004 0004    True    
5          Yes    No     2       0001 0001    True    
6          Yes    No     1       0001 0001    False    
7          No     No     2       0004 0004    True

The data displayed in the above example is not tab-separated or comma separated. It always has a header and correspondingly may/may not have values along the column-like appearance.

I have tried using basic parsing techniques such as regex and conditional checks, but I need a more robust way to parse this data as the above shown example is not the only format the data gets rendered.

Update 1: There are many cases apart from the shown example such as addition of more columns, single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).

Is there any python library to solve this problem?

Can machine learning techniques help in this problem without parsing? If yes, what type of problem would it be (Classification, Regression, Clustering)?

===============================================================
Title
===============================================================
Header     Key_1   Header Header  Header       Header
1          Key_2   3      4       5            6                   
---------------------------------------------------------------
1          Value1  No     6       0001 0002    True
           Value2    
2          Value1  Yes    7       0003 0004    False    
           Value2
3          Value1  No     6       0001 0001    True    
           Value2
4          Value1  No     6       0001 0004    False    
           Value2  
5          Value1  No     4       0004 0004    True    
           Value2  
6          Value1  No     2       0001 0001    True    
           Value2  
7          Value1  No     1       0001 0001    False    
           Value2  
8          Value1  No     2       0004 0004    True
           Value2

Update 2: Another example of what it might look like which involves a single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).

回答1:

Say your example is 'sample.txt'.

import pandas as pd

df = pd.read_table('sample.txt', skiprows=[0, 1, 2, 3, 5], delimiter='\s\s+')

print(df)
print(df.shape)

   1    2    3  4          5      6
0  1  Yes   No  6  0001 0002   True
1  2   No  Yes  7  0003 0004  False
2  3  Yes   No  6  0001 0001   True
3  4  Yes   No  6  0001 0004  False
4  4   No   No  4  0004 0004   True
5  5  Yes   No  2  0001 0001   True
6  6  Yes   No  1  0001 0001  False
7  7   No   No  2  0004 0004   True
(8, 6)

You can change the data types of course. Please check tons of params of pd.read_table(). Also, there are method for xlsx, csv, html, sql, json, hdf, even clipboard, etc.

welcome to pandas...

回答2:

Try this, it should fully handle multi-row cells:

import re

def splitLine(line, delimiters):
    output = []
    for start, end in delimiters:
        output.append(line[start:end].strip())
    return output

with open("path/to/the/file.ext", "r") as f:
    _ = f.readline()
    _ = f.readline()
    _ = f.readline()

    headers = [f.readline()]
    next = f.readline()
    while(next[0] != "-"):
        headers.append(next)
        next = f.readline()

    starts = []
    columnNames = set(headers[0].split())
    for each in columnNames:
        starts.extend([i for i in re.finditer(each, headers[0])])
    starts.sort()
    delimiters = list(zip(starts, starts[1:] + [-1]))

    if (len(columnNames) - 1):
        rowsPerEntry = len(headers)
    else:
        rowsPerEntry = 1

    headers = [splitLine(header, delimiters) for header in headers]
    keys = []
    for i in range(len(starts)):
        if ("Header" == headers[0][i]):
            keys.append(headers[1][i])
        else:
            keys.append([])
            for header in headers:
                keys[-1].append(header[i])

    entries = []
    rows = []
    for line in f:
        rows.append(splitLine(line, delimiters))
        if (rowsPerEntry == len(rows)):
            if (1 == rowsPerEntry):
                entries.append(dict(zip(keys, rows[0])))
            else:
                entries.append({})
                for i, key in enumerate(keys):
                    if (str == type(key)):
                       entries[-1][key] = rows[0][i]
                    else:
                       k = "Column " + str(i+1)
                       entries[-1][k] = dict.fromkeys(key)
                       for j, subkey in enumerate(key):
                           entries[-1][k][subkey] = rows[j][i]
            rows = []

Explanation

We use the re module in order to find the appearances og "Header" in the 4th column.

The splitLine(line, delimiters) auxiliar function returns an array of the line splitted by columns as defined by the delimiters parameter. This parameter is a list of 2-items tuples, where the first item represents the starting position and the second one the ending position.

回答3:

i dont know what you want to do with the title so i will go ahead and skip all 6 lines...the space is not consistent so you need first to make the space consistent between records otherwise it will be hard to read it line by line. you can do some thing like this

import re

def read_file():
    with open('unstructured-data.txt', 'r') as f:
         for line in f.readlines()[6:]:
             line = re.sub(" +", " ", line)
             print(line)
             record = line.split(" ")
             print(record)
read_file()

which will give you something like this

1 Yes No 6 0001 0002 True

['1', 'Yes', 'No', '6', '0001', '0002', 'True', '\n']
2 No Yes 7 0003 0004 False 

['2', 'No', 'Yes', '7', '0003', '0004', 'False', '\n']
3 Yes No 6 0001 0001 True 

['3', 'Yes', 'No', '6', '0001', '0001', 'True', '\n']
4 Yes No 6 0001 0004 False 

['4', 'Yes', 'No', '6', '0001', '0004', 'False', '\n']
4 No No 4 0004 0004 True 

['4', 'No', 'No', '4', '0004', '0004', 'True', '\n']
5 Yes No 2 0001 0001 True 

['5', 'Yes', 'No', '2', '0001', '0001', 'True', '\n']
6 Yes No 1 0001 0001 False 

['6', 'Yes', 'No', '1', '0001', '0001', 'False', '\n']
7 No No 2 0004 0004 True

['7', 'No', 'No', '2', '0004', '0004', 'True\n']

来源：https://stackoverflow.com/questions/43224981/how-to-parse-unstructured-table-like-data

标签

python

parsing

text

machine-learning

text-parsing