I'm trying to analyse the weather records for New York, using the daily data taken from here: http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html
I'm loading the data with:
tf = pandas.read_fwf(io.open('state30_NY.txt'), widths=widths, names=names, na_values=['-9999'])
Where:
>>> widths
[6, 4, 2, 4, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1, 5, 1, 1, 1]
>>> names
['COOP', 'YEAR', 'MONTH', 'ELEMENT', 'VALUE1', 'MFLAG1', 'QFLAG1', 'SFLAG1', 'VALUE2', 'MFLAG2', 'QFLAG2', 'SFLAG2', 'VALUE3', 'MFLAG3', 'QFLAG3', 'SFLAG3', 'VALUE4', 'MFLAG4', 'QFLAG4', 'SFLAG4', 'VALUE5', 'MFLAG5', 'QFLAG5', 'SFLAG5', 'VALUE6', 'MFLAG6', 'QFLAG6', 'SFLAG6', 'VALUE7', 'MFLAG7', 'QFLAG7', 'SFLAG7', 'VALUE8', 'MFLAG8', 'QFLAG8', 'SFLAG8', 'VALUE9', 'MFLAG9', 'QFLAG9', 'SFLAG9', 'VALUE10', 'MFLAG10', 'QFLAG10', 'SFLAG10', 'VALUE11', 'MFLAG11', 'QFLAG11', 'SFLAG11', 'VALUE12', 'MFLAG12', 'QFLAG12', 'SFLAG12', 'VALUE13', 'MFLAG13', 'QFLAG13', 'SFLAG13', 'VALUE14', 'MFLAG14', 'QFLAG14', 'SFLAG14', 'VALUE15', 'MFLAG15', 'QFLAG15', 'SFLAG15', 'VALUE16', 'MFLAG16', 'QFLAG16', 'SFLAG16', 'VALUE17', 'MFLAG17', 'QFLAG17', 'SFLAG17', 'VALUE18', 'MFLAG18', 'QFLAG18', 'SFLAG18', 'VALUE19', 'MFLAG19', 'QFLAG19', 'SFLAG19', 'VALUE20', 'MFLAG20', 'QFLAG20', 'SFLAG20', 'VALUE21', 'MFLAG21', 'QFLAG21', 'SFLAG21', 'VALUE22', 'MFLAG22', 'QFLAG22', 'SFLAG22', 'VALUE23', 'MFLAG23', 'QFLAG23', 'SFLAG23', 'VALUE24', 'MFLAG24', 'QFLAG24', 'SFLAG24', 'VALUE25', 'MFLAG25', 'QFLAG25', 'SFLAG25', 'VALUE26', 'MFLAG26', 'QFLAG26', 'SFLAG26', 'VALUE27', 'MFLAG27', 'QFLAG27', 'SFLAG27', 'VALUE28', 'MFLAG28', 'QFLAG28', 'SFLAG28', 'VALUE29', 'MFLAG29', 'QFLAG29', 'SFLAG29', 'VALUE30', 'MFLAG30', 'QFLAG30', 'SFLAG30', 'VALUE31', 'MFLAG31', 'QFLAG31', 'SFLAG31']
Now, the issue I have is that when reading in the data, there seem to be a lot of inf
values, and those shouldn't be in the source data (the nearest thing in the data are -9999
values, which represent invalid data).
Normally, if I were using lists
or the like, I would print out the whole thing to find alignment errors, and work out which rows are affected, then look at the source file to see what's happening. I'd like to know how to do the equivalent in pandas so I can figure out where these inf
values are coming from.
Here's the code which shows me inf
s:
>>> tf[tf['ELEMENT']=='TMIN'].min()
COOP 300023
YEAR 1876
MONTH 1
ELEMENT TMIN
VALUE1 -38
MFLAG1 inf
QFLAG1 inf
SFLAG1 inf
VALUE2 -34
MFLAG2 inf
QFLAG2 inf
SFLAG2 inf
VALUE3 -38
MFLAG3 inf
QFLAG3 inf
...
MFLAG28 inf
QFLAG28 inf
SFLAG28 inf
VALUE29 -46
MFLAG29 inf
QFLAG29 inf
SFLAG29 inf
VALUE30 -57
MFLAG30 inf
QFLAG30 inf
SFLAG30 inf
VALUE31 -40
MFLAG31 inf
QFLAG31 inf
SFLAG31 inf
Length: 128, dtype: object
Edit: Corrected column widths. Problem still remains.
First, let's mock up some data:
import numpy as np
import pandas
df = pandas.DataFrame(
np.random.normal(size=(5,5)),
index='rA,rB,rC,rD,rE'.split(','),
columns='cA,cB,cC,cD,cE'.split(',')
)
df[df > 1] = np.inf
df
That, for examples, should give something like this:
cA cB cC cD cE
rA -1.202383 -0.625521 inf -0.888086 -0.215671
rB 0.537521 -1.149731 0.841687 0.190505 inf
rC -1.447124 -0.607486 -1.268923 inf 0.438190
rD -0.275085 0.793483 0.276376 -0.095727 -0.050957
rE -0.095414 0.048926 0.591899 0.298865 -0.308620
So now I can use fancy indexing to isolate all the inf
s.
print(df[np.isinf(df)].to_string())
cA cB cC cD cE
rA NaN NaN inf NaN NaN
rB NaN NaN NaN NaN inf
rC NaN NaN NaN inf NaN
rD NaN NaN NaN NaN NaN
rE NaN NaN NaN NaN NaN
But that's not really useful. So on top of finding the inf
s, we should stack the column index into the rows (unpivot, if you will) then drop all the NaN
values. This will give us a nice summary of the rows/columns with inf
s.
df[np.isinf(df)].stack().dropna()
rA cC inf
rB cE inf
rC cD inf
dtype: float64
np.isinf
will fail if you have object
dtypes in you dataframe.
To overcome this:
with pd.option_context('mode.use_inf_as_null', True):
is_bad_data = df.isnull()
来源:https://stackoverflow.com/questions/20935264/finding-bogus-data-in-a-pandas-dataframe-read-with-read-fwf