问题
I'm using the following to ensure a dataframe column has the correct data type before I proceed with operations:
>>> cfun = lambda x: float(x)
>>> df = pd.read_excel(xl, converters={'column1': cfun})
Using converters instead of dtype so that the traceback will tell me explicitly what value caused the issue:
ValueError: could not convert string to float: '100%'
What I would like to do is take that information (that the string "100%" was the problem) and tell the user where it occurred in the dataframe/file. How can I get that information from the exception in order to get a row index and, say, print the entire row?
Note: Adding the percent sign isn't the only mistake my users make, otherwise I'd just replace any '%' with ''.
回答1:
I think you can check by first reading in the csv, and then checking which rows wouldn't convert. This finds them all at once, instead of one by one with the ValueError
.
Just remember, python begins numbering at 0 and wont include the header so the row indices of the df
will be off from those in the csv (by 1 or 2).
import pandas as pd
df = pd.read_excel(xl)
# Example df
column1 column2
0 100 A
1 100% B
2 112,312 C
3 171 D
4 123.123 E
5 NaN F
df['column1_num'] = pd.to_numeric(df.column1, errors='coerce')
bad_mask = (df.column1_num.isnull()) & ~(df.column1.astype('str').str.lower().isin(['nan']))
bad_rows = df[bad_mask].index.values
#array([1, 2], dtype=int64)
df[bad_mask]
# column1 column2 column1_num
#1 100% B NaN
#2 112,312 C NaN
I updated the mask because float
is able to handle the 'NaN'
string, so it wont actually show up as an issue in your read, though pd.to_numeric
still coerces it to NaN
.
float('NaN')
#nan
pd.to_numeric('NaN')
#ValueError: Unable to parse string "NaN" at position 0
来源:https://stackoverflow.com/questions/49902930/access-specifics-of-valueerror-in-pandas-read-excel-converters