问题
I am parsing an image (png) with Amazon Textract and extracting the tables.
Here is an example of such csv when I open it with open(file_name, "r")
and reading it's lines:
['Table: Table_1\n',
'\n',
'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
'AST ,27 ,,10-35 U/L ,EN ,\n',
'ALT ,19 ,,9-46 U/L ,EN ,\n',
'\n',
'\n',
'\n',
'\n',
'\n']
I can read it with pandas
read_csv
but I am getting errors (it's always come as different format - more or less spaces, different first lines before the titles).
Please advise how to extract the table from such csv's?
回答1:
Using a regex you can parse each line in your file to look for given patterns and reject those that dont match. Create groups in the regex will allow you to extract the values you need and store them in a list of tuple that can be used to construct a dataframe :
import re
import pandas as pd
data = ['Table: Table_1\n',
'\n',
'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
'AST ,27 ,,10-35 U/L ,EN ,\n',
'ALT ,19 ,,9-46 U/L ,EN ,\n',
'\n',
'\n',
'\n',
'\n',
'\n']
regex=re.compile(r'(\D+),\s*(\d+\.?\d*)\s*,\s*(.*?)\s*,\s*(.*?)\s*,\s*(.*?)\s*,')
result=[]
for line in data:
match=regex.search(line)
if match:
result.append(match.groups())
df=pd.DataFrame(data=result,columns=('Test Name' ,'Result' ,'Flag' ,'Reference Range' ,'Lab'))
print df
The result :
Test Name Result Flag Reference Range \
0 HEPATIC FUNCTION PANEL PROTEIN, TOTAL 6.1 6.1-8.1 g/dL
1 ALBUMIN 4.3 3.6-5.1 g/dL
2 GLOBULIN 1.8 LOW 1.9-3.7 g/dL (calc)
3 ALBUMIN/GLOBULIN RATIO 2.4 1.0-2.5 (calc)
4 BILIRUBIN, TOTAL 0.6 0.2-1.2 mg/dL
5 BILIRUBIN, DIRECT 0.2 < OR = 0.2 mg/dL
6 BILIRUBIN, INDIRECT 0.4 0.2-1.2 mg/dL (calc)
7 ALKALINE PHOSPHATASE 61 40-115 U/L
8 AST 27 10-35 U/L
9 ALT 19 9-46 U/L
Lab
0 EN
1 EN
2 EN
3 EN
4 EN
5 EN
6 EN
7 EN
8 EN
9 EN
回答2:
I would suggest to curate your data, inserting curated data onto Pandas as list of list. The problem I've found with your sample is that, in the first field, it contains comas, which interfere with CSV parsing, working by coma separator as well. Thus, a curations of the data is required. Please, find my source code for Python 3 below:
data = ['Table: Table_1\n',
'\n',
'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
'AST ,27 ,,10-35 U/L ,EN ,\n',
'ALT ,19 ,,9-46 U/L ,EN ,\n',
'\n',
'\n',
'\n',
'\n',
'\n']
lines = [x.replace('\n','') for x in data]
import re
p = re.compile('^[/A-Z ]+[,]*[/A-Z ]*,')
curated_lines = []
for l in lines:
m = p.search(l)
if m != None:
s = m.group(0)
cs = s.replace(',','')
cl = l.replace(s,cs+',')
curated_lines.append(cl)
frame_list_of_list = [l.split(',')[:-1] for l in curated_lines]
import pandas as pd
df = pd.DataFrame(frame_list_of_list,columns=['Test Name','Result','Flag','Reference Range','Lab'])
print(df)
Which yields the following results:
Test Name Result Flag Reference Range Lab
0 HEPATIC FUNCTION PANEL PROTEIN TOTAL 6.1 6.1-8.1 g/dL EN
1 ALBUMIN 4.3 3.6-5.1 g/dL EN
2 GLOBULIN 1.8 LOW 1.9-3.7 g/dL (calc) EN
3 ALBUMIN/GLOBULIN RATIO 2.4 1.0-2.5 (calc) EN
4 BILIRUBIN TOTAL 0.6 0.2-1.2 mg/dL EN
5 BILIRUBIN DIRECT 0.2 < OR = 0.2 mg/dL EN
6 BILIRUBIN INDIRECT 0.4 0.2-1.2 mg/dL (calc) EN
7 ALKALINE PHOSPHATASE 61 40-115 U/L EN
8 AST 27 10-35 U/L EN
9 ALT 19 9-46 U/L EN
来源:https://stackoverflow.com/questions/61663415/parse-extract-table-from-a-messed-csv-file