问题
I have a quite large 1d numpy array Xold with given values. These values shall be replaced according to the rule specified by a 2d numpy array Y: An example would be
Xold=np.array([0,1,2,3,4])
Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])
Whenever a value in Xold is identical to a value in Y[:,0], the new value in Xnew should be the corresponding value in Y[:,1]. This is accomplished by two nested for loops:
Xnew=np.zeros(len(Xold))
for i in range(len(Xold)):
for j in range(len(Y)):
if Xold[i]==Y[j,0]:
Xnew[i]=Y[j,1]
With the given example, this yields Xnew=[0,100,200,300,400]
.
However, for large data sets this procedure is quite slow. What is a faster and more elegant way to accomplish this task?
回答1:
SELECTING THE FASTEST METHOD
Answers to this question provided a nice assortment of ways to replace elements in numpy array. Let's check, which one would be the quickest.
TL;DR: Numpy indexing is the winner
def meth1(): # suggested by @Slam
for old, new in Y:
Xold[Xold == old] = new
def meth2(): # suggested by myself, convert y_dict = dict(Y) first
[y_dict[i] if i in y_dict.keys() else i for i in Xold]
def meth3(): # suggested by @Eelco Hoogendoom, import numpy_index as npi first
npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])
def meth4(): # suggested by @Brad Solomon, import pandas as pd first
pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values
# suggested by @jdehesa. create Xnew = Xold.copy() and index
# idx = np.searchsorted(Xold, Y[:, 0]) first
def meth5():
Xnew[idx] = Y[:, 1]
Not so surprising results
In [39]: timeit.timeit(meth1, number=1000000)
Out[39]: 12.08
In [40]: timeit.timeit(meth2, number=1000000)
Out[40]: 2.87
In [38]: timeit.timeit(meth3, number=1000000)
Out[38]: 55.39
In [12]: timeit.timeit(meth4, number=1000000)
Out[12]: 256.84
In [50]: timeit.timeit(meth5, number=1000000)
Out[50]: 1.12
So, the good old list comprehension is the second fastest, and the winning approach is numpy indexing combined with searchsorted()
.
回答2:
We can use np.searchsorted for a generic case when the data in first column of Y
is not necessarily sorted -
sidx = Y[:,0].argsort()
out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]
Sample run -
In [53]: Xold
Out[53]: array([14, 10, 12, 13, 11])
In [54]: Y
Out[54]:
array([[ 10, 0],
[ 11, 100],
[ 13, 300],
[ 14, 400],
[ 12, 200]])
In [55]: sidx = Y[:,0].argsort()
...: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]
In [56]: out
Out[56]: array([400, 0, 200, 300, 100])
If not all elements have corresponding mappings available, then we need to do a bit more of work, like so -
sidx = Y[:,0].argsort()
sorted_indx = np.searchsorted(Y[:,0], Xold, sorter=sidx)
sorted_indx[sorted_indx==len(sidx)] = len(sidx)-1
idx_out = sidx[sorted_indx]
out = Y[idx_out,1]
out[Y[idx_out,0]!=Xold] = 0 # NA values as 0s
回答3:
Here is one possibility:
import numpy as np
Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
# Check every X value against every Y first value
m = Xold == Y[:, 0, np.newaxis]
# Check which elements in X are among Y first values
# (so values that are not in Y are not replaced)
m_X = np.any(m, axis=0)
# Compute replacement
# Xold * (1 - m_X) are the non-replaced values
# np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X are the replaced values
Xnew = Xold * (1 - m_X) + np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X
print(Xnew)
Output:
[ 0 100 200 300 400]
This method works for more or less every case (unsorted arrays, multiple repetitions of values in X, values in X not replaced, values in Y not replacing anything in X), except if you give two replacements for the same value in Y, which would be wrong anyway. However, its time and space complexity is the product of the sizes of X and Y. If your problem has additional constraints (data is sorted, no repetitions, etc.) it might be possible to do something better. For example, if X is sorted with no repeated elements and every value in Y replaces a value in X (like in your example), this would probably be faster:
import numpy as np
Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
idx = np.searchsorted(Xold, Y[:, 0])
Xnew = Xold.copy()
Xnew[idx] = Y[:, 1]
print(Xnew)
# [ 0 100 200 300 400]
回答4:
First improvement you can do is to use numpy indexing, but you'll still have 1 loop:
for old, new in Y:
Xold[Xold == old] = new
回答5:
You can use slicing
features in combination with argsort
method.
Xnew = Y[Y[:,1].argsort()][:, 1][Xold]
Output
array([ 0, 100, 200, 300, 400])
回答6:
Solution with pd.Series.map()
If you're open to using the Pandas library, you can also do this in a vectorized way with .map()
:
>>> import pandas as pd
>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0]))
0 0
1 100
2 200
3 300
4 400
dtype: int64
>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values
array([ 0, 100, 200, 300, 400])
For the signature a.map(b)
, a
looks for its corresponding entries in the index of b
, and maps to the respective values in b
.
b
here is pd.Series(Y[:, 1], index=Y[:, 0])
, which uses the 0th column as the index and 1st column as the values that get mapped to.
Using pandas.core.algorithms
directly
Under the hood, this will use .get_indexer()
and the Cython-implemented take_1d()
:
indexer = mapper.index.get_indexer(values)
new_values = algorithms.take_1d(mapper._values, indexer)
Knowing that, if the arrays are really massive you could cut down some overhead like this:
from pandas.core import algorithms
indexer = pd.Index(Y[:, 0]).get_indexer(Xold)
mapped = algorithms.take_1d(Y[:, 1], indexer)
回答7:
The numpy_indexed package (disclaimer; I am its author) contains an efficient vectorized function that solves the general problem:
import numpy_indexed as npi
Xnew = npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])
That is, this will work for any dtype, or when keys and values to be replaced are themselves ndarrays, and you get a kwarg to specify how to react to missing elements.
Not sure how it compares to pandas performance-wise; but one of the design choices in this library is that performing elementary operations like this (or doing group-by etc) should not involve the creation of an entire new datatype like a Series or Table, which always bothered me about using pandas for this type of thing.
回答8:
You could convert Y to dictionary with y = dict(Y)
and then run the following list comprehension
[y[i] if i in y.keys() else i for i in Xold]
来源:https://stackoverflow.com/questions/53155749/replace-elements-in-numpy-array-avoiding-loops