问题
I'm using numpy genfromtxt, and I need to identify both missing data and bad data. Depending on user input, I may want to drop bad value or raise error. Essentially, I want to treat missing and bad data as the same thing.
Say I have a file like this, where the columns are of data types "date, int, float"
date,id,value
2017-12-4,0, # BAD. missing data
2017-12-4,1,XYZ # BAD. value should be float, not string.
2017-12-4,2,1.0 # good
2017-12-4,3,1.0 # good
2017-12-4,4,1.0 # good
I would like to detect both. So, I do this
dtype=(np.dtype('<M8[D]'), np.dtype('int64'), np.dtype('float64'))
result = np.genfromtxt(filename, delimiter=',', dtype=dtype, names=True, usemask=True, usecols=('date', 'id', 'value'))
And the result is this
masked_array(data=[(datetime.date(2017, 12, 4), 0, --),
(datetime.date(2017, 12, 4), 1, nan),
(datetime.date(2017, 12, 4), 2, 1.0),
(datetime.date(2017, 12, 4), 3, 1.0),
(datetime.date(2017, 12, 4), 4, 1.0)],
mask=[(False, False, True), (False, False, False),
(False, False, False), (False, False, False),
(False, False, False)],
fill_value=('NaT', 999999, 1.e+20),
dtype=[('date', '<M8[D]'), ('id', '<i8'), ('value', '<f8')])
I thought the whole point of masked_array is that it can handle missing data AND bad data. But here, it's only handling missing data.
result['value'].mask
returns
array([ True, False, False, False, False])
The "bad" data actually still got into the array, as nan. I was hoping the mask would give me True True False False False
.
In order for me to realize we have a bad value on the 2nd row, I need to do additional work, like check for nan.
another_mask = np.isnan(result['value'])
good_result = result['value'][~another_mask]
Finally, this returns
masked_array(data=[1.0, 1.0, 1.0],
mask=[False, False, False],
fill_value=1e+20)
That works, but I feel like I'm doing something wrong. The whole point of maskedArray is to find missing AND bad data, but I'm somehow only using it to find missing data. And I need my own check to find bad data. Feels ugly and not-pythonic.
Is there a way to find both at the same time?
回答1:
Playing around with a simple input:
In [143]: txt='''1,2
...: 3,nan
...: 4,foo
...: 5,
...: '''.splitlines()
In [144]: txt
Out[144]: ['1,2', '3,nan', '4,foo', '5,']
By specifying a specific string as 'missing' (it may be a list?), I can 'mask' it, along with blank:
In [146]: np.genfromtxt(txt,delimiter=',', missing_values='foo',
usemask=True, usecols=1)
Out[146]:
masked_array(data=[2.0, nan, --, --],
mask=[False, False, True, True],
fill_value=1e+20)
It looks like it converted all values with float
, but generated the mask based on the strings (or lack there of):
In [147]: _.data
Out[147]: array([ 2., nan, nan, nan])
I can replace both types of 'missing' with a specific value. Since it's doing a float
conversion, the fill has to be 100
or '100'
:
In [151]: np.genfromtxt(txt,delimiter=',', missing_values='foo',
usecols=1, filling_values=100)
Out[151]: array([ 2., nan, 100., 100.])
In a more complex case I can imagine writing a converter for the column. I've only dabbled in that feature.
The documentation for these parameters is slim, so figuring out what combinations work, and in what order, is a matter of trial-and-error (or a lots of code digging).
More details in the follow up question: numpy genfromtxt - how to detect bad int input values
来源:https://stackoverflow.com/questions/65298097/numpy-genfromtxt-missing-data-vs-bad-data