问题
Here is a trivial example of a bad int
value to numpy.genfromtxt
. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.
>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()
My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt
, I cannot detect this bad value.
>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out
masked_array(data=[(0, -1), (1, 2), (3, 4)],
mask=[(False, False), (False, False), (False, False)],
fill_value=(999999, 999999),
dtype=[('a', '<i8'), ('b', '<i8')])
>>> out['b'].data
array([-1, 2, 4])
I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input
>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()
I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?
回答1:
I found in np.lib._iotools
a function
def _loose_call(self, value):
try:
return self.func(value)
except ValueError:
return self.default
When genfromtxt
is processing a line it does
if loose:
rows = list(
zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
for (i, conv) in enumerate(converters)]))
where loose
is an input parameter. So in the case of int
converter it tries
int(astring)
and if that produces a ValueError
it returns the default value (e.g. -1
) instead of raising an error. Similarly for float
and np.nan
.
The usemask
parameter is applied in:
if usemask:
append_to_masks(tuple([v.strip() in m
for (v, m) in zip(values,
missing_values)]))
Define 2 converters to give more information on what's processed:
def myint(astr):
try:
v = int(astr)
except ValueError:
print('err',astr)
v = '-999'
return v
def myfloat(astr):
try:
v = float(astr)
except ValueError:
print('err',astr)
v = '-inf'
return v
A sample text:
txt='''1,2
3,nan
,foo
bar,
'''.splitlines()
And using the converters:
In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]:
array([( 1, 2.), ( 3, nan), (-999, -inf), (-999, -inf)],
dtype=[('f0', '<i8'), ('f1', '<f8')])
And to see what usemask
does:
In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]:
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
mask=[(False, False), (False, False), ( True, False),
(False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
A missing value is a '' string, and int('')
produces a ValueError just as int('bad')
does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask
.
来源:https://stackoverflow.com/questions/65317590/numpy-genfromtxt-how-to-detect-bad-int-input-values