Parse / Extract table from a messed .csv file?

问题

I am parsing an image (png) with Amazon Textract and extracting the tables. Here is an example of such csv when I open it with open(file_name, "r") and reading it's lines:

['Table: Table_1\n',
 '\n',
 'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
 'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
 'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
 'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
 'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
 'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
 'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
 'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
 'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
 'AST ,27 ,,10-35 U/L ,EN ,\n',
 'ALT ,19 ,,9-46 U/L ,EN ,\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

I can read it with pandas read_csv but I am getting errors (it's always come as different format - more or less spaces, different first lines before the titles). Please advise how to extract the table from such csv's?

回答1:

Using a regex you can parse each line in your file to look for given patterns and reject those that dont match. Create groups in the regex will allow you to extract the values you need and store them in a list of tuple that can be used to construct a dataframe :

import re
import pandas as pd

data = ['Table: Table_1\n',
        '\n',
        'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
        'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
        'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
        'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
        'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
        'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
        'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
        'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
        'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
        'AST ,27 ,,10-35 U/L ,EN ,\n',
        'ALT ,19 ,,9-46 U/L ,EN ,\n',
        '\n',
        '\n',
        '\n',
        '\n',
        '\n']

regex=re.compile(r'(\D+),\s*(\d+\.?\d*)\s*,\s*(.*?)\s*,\s*(.*?)\s*,\s*(.*?)\s*,')
result=[]
for line in data:
    match=regex.search(line)
    if match:
        result.append(match.groups())
df=pd.DataFrame(data=result,columns=('Test Name' ,'Result' ,'Flag' ,'Reference Range' ,'Lab'))
print df

The result :

                                Test Name Result Flag       Reference Range  \
0  HEPATIC FUNCTION PANEL PROTEIN, TOTAL     6.1               6.1-8.1 g/dL   
1                                ALBUMIN     4.3               3.6-5.1 g/dL   
2                               GLOBULIN     1.8  LOW   1.9-3.7 g/dL (calc)   
3                 ALBUMIN/GLOBULIN RATIO     2.4             1.0-2.5 (calc)   
4                       BILIRUBIN, TOTAL     0.6              0.2-1.2 mg/dL   
5                      BILIRUBIN, DIRECT     0.2           < OR = 0.2 mg/dL   
6                    BILIRUBIN, INDIRECT     0.4       0.2-1.2 mg/dL (calc)   
7                   ALKALINE PHOSPHATASE      61                 40-115 U/L   
8                                    AST      27                  10-35 U/L   
9                                    ALT      19                   9-46 U/L   

  Lab  
0  EN  
1  EN  
2  EN  
3  EN  
4  EN  
5  EN  
6  EN  
7  EN  
8  EN  
9  EN

回答2:

I would suggest to curate your data, inserting curated data onto Pandas as list of list. The problem I've found with your sample is that, in the first field, it contains comas, which interfere with CSV parsing, working by coma separator as well. Thus, a curations of the data is required. Please, find my source code for Python 3 below:

data = ['Table: Table_1\n',
        '\n',
        'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
        'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
        'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
        'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
        'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
        'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
        'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
        'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
        'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
        'AST ,27 ,,10-35 U/L ,EN ,\n',
        'ALT ,19 ,,9-46 U/L ,EN ,\n',
        '\n',
        '\n',
        '\n',
        '\n',
        '\n']



lines  = [x.replace('\n','') for x in data]

import re
p = re.compile('^[/A-Z ]+[,]*[/A-Z ]*,')
curated_lines = []
for l in lines:
    m = p.search(l)
    if m != None:
        s   = m.group(0)
        cs  = s.replace(',','')
        cl  = l.replace(s,cs+',')
        curated_lines.append(cl)

frame_list_of_list = [l.split(',')[:-1] for l in curated_lines]

import pandas as pd
df = pd.DataFrame(frame_list_of_list,columns=['Test Name','Result','Flag','Reference Range','Lab'])
print(df)

Which yields the following results:

                           Test Name Result  Flag        Reference Range  Lab
0  HEPATIC FUNCTION PANEL PROTEIN TOTAL    6.1                 6.1-8.1 g/dL   EN 
1                               ALBUMIN    4.3                 3.6-5.1 g/dL   EN 
2                              GLOBULIN    1.8   LOW    1.9-3.7 g/dL (calc)   EN 
3                ALBUMIN/GLOBULIN RATIO    2.4               1.0-2.5 (calc)   EN 
4                       BILIRUBIN TOTAL    0.6                0.2-1.2 mg/dL   EN 
5                      BILIRUBIN DIRECT    0.2             < OR = 0.2 mg/dL   EN 
6                    BILIRUBIN INDIRECT    0.4         0.2-1.2 mg/dL (calc)   EN 
7                  ALKALINE PHOSPHATASE     61                   40-115 U/L   EN 
8                                   AST     27                    10-35 U/L   EN 
9                                   ALT     19                     9-46 U/L   EN

来源：https://stackoverflow.com/questions/61663415/parse-extract-table-from-a-messed-csv-file

标签

python-3.x

pandas

amazon-textract