问题
Given a pandas.DataFrame
with a column holding mixed datatypes, like e.g.
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})
I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.
I could iteratively derive a mask and use it in loc
, like
m = np.array([isinstance(v, int) for v in df['mixed']])
df.loc[m, 'mixed'] *= 10
# df
# mixed
# 0 2020-10-04 00:00:00
# 1 9990
# 2 a string
That does the trick but I was wondering if there was a more pandas
tic way of doing this?
回答1:
Still need call type
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
回答2:
One idea is test if numeric by to_numeric with errors='coerce'
and for non missing values:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Unfortunately is is slow, some another ideas:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.
回答3:
If you want to multiple all 'numbers' then you can use the following.
Let's use pd.to_numeric
with parameter errors = 'coerce'
and fillna
:
df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df
Output:
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
Let's add a float to the column
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})
Using @BenYo:
m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
Output (note only the integer 999 is multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 100.3
Using @jezrael and similiarly this solution:
m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)
# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
Output (note all numbers are multiplied by 10):
mixed
0 2020-10-04 00:00:00
1 9990
2 a string
3 1003
回答4:
If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.
here's the code:
N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation
the mixed_type column repr is
0 Timestamp
1 int
2 str
3 Timestamp
4 int
...
2999995 int
2999996 str
2999997 Timestamp
2999998 int
2999999 str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]
and here's the timeit
>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
回答5:
For not very long data frames I can suggest this way as well:
df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))
来源:https://stackoverflow.com/questions/64195782/python-pandas-how-to-obtain-the-datatypes-of-objects-in-a-mixed-datatype-column