问题
What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard()
cannot manage this task on its own.
Example 1:
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
Expected result:
Age Category A B C D
21-26 2 2 4 1
26-31 7 11 12 5
31-36 3 5 5 2
36-41 2 4 1 7
41-46 0 1 3 2
46-51 0 0 2 3
EDIT:
Example 2:
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Expected result:
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
I look for a universal approach that can be applied to the most common table types.
回答1:
One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv
can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.
To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv
call.
import os
def load(filename):
with open(filename) as fin, open('temp.txt', 'w') as fout:
for line in fin:
if not line.strip()[:2] in {'|-', '+-'}: # filter step
fout.write(line.strip().strip('|').replace('|', ',')+'\n')
df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
os.unlink('temp.txt') # cleanup
return df
df1 = load('data1.txt')
df2 = load('data2.txt')
df1
Age Category A B C
0 21-26 2 2 4
1 26-31 7 11 12
2 31-36 3 5 5
3 36-41 2 4 1
4 41-46 0 1 3
5 46-51 0 0 2
df2
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
回答2:
The reason this is so complicated is that these type of ASCII tables or not really designed with data transfer in mind. Their true function is to depict the data in a visually pleasing way.
This doesn't mean it is not possible to use it to transfer into pandas! Let's start with .read_clipboard()
:
df = pd.read_clipboard(sep='|').iloc[1:,1:-1]
Instead of using a comma as the (default) separator we define |
to be the separator.
The .iloc[1:,1:-1]
gets rid of the first row (-----------
) and the first and last columns: because of the trailing |
at the beginning and end of each line pandas
sees an 'empty' column there.
Now all that is left is to strip whitespace from the column names and values:
stripped_columns = []
for column_name in df.columns:
df[column_name] = df[column_name].str.strip()
stripped_columns.append(column_name.strip())
df.columns = stripped_columns
And if you want Age Category
to be your index:
df.set_index('Age Category', inplace=True)
Last pass I would make would be to make sure all your columns are now actually holding numbers and not strings:
df = df.astype('int')
Resulting in:
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 21-26 to 46-51
Data columns (total 4 columns):
A 6 non-null int64
B 6 non-null int64
C 6 non-null int64
D 6 non-null int64
dtypes: int64(4)
memory usage: 400.0+ bytes
I am not sure what your reason is for reading it from the clipboard. A bit more elegant solution might be to paste it into a .csv
file and use the more advanced features .read_csv() has to offer. The necessary transformations however would remain the same.
回答3:
Here is another potential solution using re.sub and io.StringIO :
from io import StringIO
import re
text1 = """
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
"""
text2= """
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
"""
df1 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', ' ', text1)), sep='\s{2,}', engine='python')
df2 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', ' ', text2)), sep='\s{2,}', engine='python')
[out]
df1
Age Category A B C D
0 21-26 2 2 4 1
1 26-31 7 11 12 5
2 31-36 3 5 5 2
3 36-41 2 4 1 7
4 41-46 0 1 3 2
5 46-51 0 0 2 3
df2
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
回答4:
For this type of table, you can simply use:
df = pd.read_clipboard(sep='|')
Minimal cleanup is then needed:
df = df.drop(0)
df = df.drop(['Unnamed: 0','Unnamed: 6'], axis=1)
As for the "writing such a spreadsheet" question... I don't see how anything could be more convenient than the plain presentation, but here's bad code for it, given the above cleaned df
:
df1 = df.append(pd.DataFrame({i:['-'*len(i)] for i in df.columns})).sort_index() #adding the separator to column titles
df2 = pd.DataFrame({str(i)+'|':['|']*len(df1) for i in range(len(df1.columns))})
df3 = df1.join(df2)
col_order = [j for i in [[df1.columns[x], df2.columns[x]] for x in range(len(df1.columns))] for j in i]
df3.index = ['|']*len(df3.index)
Then:
df3[col_order]
Age Category 0| A 1| B 2| C 3| D 4|
| -------------- | --- | ---- | ---- | --- |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
(edited)
来源:https://stackoverflow.com/questions/59211661/parse-prettyprinted-tabular-data-with-pandas