Parse prettyprinted tabular data with pandas

戏子无情 提交于 2020-01-02 04:36:06

问题


What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard() cannot manage this task on its own.

Example 1:

| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |

Expected result:

 Age Category  A  B   C   D    
 21-26         2  2   4   1 
 26-31         7  11  12  5 
 31-36         3  5   5   2 
 36-41         2  4   1   7 
 41-46         0  1   3   2 
 46-51         0  0   2   3

EDIT:

Example 2:

+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Expected result:

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

I look for a universal approach that can be applied to the most common table types.


回答1:


One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.

To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv call.

import os 

def load(filename):
    with open(filename) as fin, open('temp.txt', 'w') as fout:
        for line in fin:
            if not line.strip()[:2] in {'|-', '+-'}: # filter step
                fout.write(line.strip().strip('|').replace('|', ',')+'\n')

    df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
    os.unlink('temp.txt') # cleanup

    return df

df1 = load('data1.txt')
df2 = load('data2.txt')

df1

  Age Category  A   B   C
0        21-26  2   2   4
1        26-31  7  11  12
2        31-36  3   5   5
3        36-41  2   4   1
4        41-46  0   1   3
5        46-51  0   0   2

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson



回答2:


The reason this is so complicated is that these type of ASCII tables or not really designed with data transfer in mind. Their true function is to depict the data in a visually pleasing way.

This doesn't mean it is not possible to use it to transfer into pandas! Let's start with .read_clipboard():

df = pd.read_clipboard(sep='|').iloc[1:,1:-1]

Instead of using a comma as the (default) separator we define | to be the separator.

The .iloc[1:,1:-1] gets rid of the first row (-----------) and the first and last columns: because of the trailing | at the beginning and end of each line pandas sees an 'empty' column there.

Now all that is left is to strip whitespace from the column names and values:

stripped_columns = []
for column_name in df.columns:
    df[column_name] = df[column_name].str.strip()
    stripped_columns.append(column_name.strip())
df.columns = stripped_columns

And if you want Age Category to be your index:

df.set_index('Age Category', inplace=True)

Last pass I would make would be to make sure all your columns are now actually holding numbers and not strings:

df = df.astype('int')

Resulting in:

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 21-26 to 46-51
Data columns (total 4 columns):
A    6 non-null int64
B    6 non-null int64
C    6 non-null int64
D    6 non-null int64
dtypes: int64(4)
memory usage: 400.0+ bytes

I am not sure what your reason is for reading it from the clipboard. A bit more elegant solution might be to paste it into a .csv file and use the more advanced features .read_csv() has to offer. The necessary transformations however would remain the same.




回答3:


Here is another potential solution using re.sub and io.StringIO :

from io import StringIO
import re

text1 = """
| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |
"""

text2= """
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+
"""

df1 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text1)), sep='\s{2,}', engine='python')
df2 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text2)), sep='\s{2,}', engine='python')

[out]

df1

  Age Category  A   B   C  D
0        21-26  2   2   4  1
1        26-31  7  11  12  5
2        31-36  3   5   5  2
3        36-41  2   4   1  7
4        41-46  0   1   3  2
5        46-51  0   0   2  3

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson



回答4:


For this type of table, you can simply use:

df = pd.read_clipboard(sep='|')

Minimal cleanup is then needed:

df = df.drop(0)
df = df.drop(['Unnamed: 0','Unnamed: 6'], axis=1)

As for the "writing such a spreadsheet" question... I don't see how anything could be more convenient than the plain presentation, but here's bad code for it, given the above cleaned df:

df1 = df.append(pd.DataFrame({i:['-'*len(i)] for i in df.columns})).sort_index() #adding the separator to column titles
df2 = pd.DataFrame({str(i)+'|':['|']*len(df1) for i in range(len(df1.columns))})
df3 = df1.join(df2)
col_order = [j for i in [[df1.columns[x], df2.columns[x]] for x in range(len(df1.columns))] for j in i]
df3.index = ['|']*len(df3.index)

Then:

df3[col_order]

    Age Category  0|   A  1|   B   2|   C   3|   D  4|
|  --------------  |  ---  |  ----  |  ----  |  ---  |
|   21-26          |   2   |   2    |   4    |   1   |
|   26-31          |   7   |   11   |   12   |   5   |
|   31-36          |   3   |   5    |   5    |   2   |
|   36-41          |   2   |   4    |   1    |   7   |
|   41-46          |   0   |   1    |   3    |   2   |
|   46-51          |   0   |   0    |   2    |   3   |

(edited)



来源:https://stackoverflow.com/questions/59211661/parse-prettyprinted-tabular-data-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!