Using numpy.unique on multiple columns of a pandas.DataFrame

时光毁灭记忆、已成空白 提交于 2019-12-23 17:27:46

问题


I am looking to use numpy.unique to obtain the reverse unique indexes of two columns of a pandas.DataFrame.

I know how to use it on one column:

u, rev = numpy.unique(df[col], return_inverse=True)

But I want to use it on multiple columns. For example, if the df looks like:

    0   1   
0   1   1
1   1   2
2   2   1
3   2   1
4   3   1

then I would like to get the reverse indexes:

[0,1,2,2,3]

回答1:


Approach #1

Here's one NumPy approach converting each row to a scalar each thinking of each row as one indexing tuple on a two-dimensional (for 2 columns of data) grid -

def unique_return_inverse_2D(a): # a is array
    a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1))
    return np.unique(a1D, return_inverse=1)[1]

If you have negative numbers in the data, we need to use min too to get those scalars. So, in that case, use a.max(0) - a.min(0) + 1 in place of a.max(0) + 1.

Approach #2

Here's another NumPy's views based solution with focus on performance inspired by this smart solution by @Eric -

def unique_return_inverse_2D_viewbased(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    return np.unique(a.view(void_dt).ravel(), return_inverse=1)[1]

Sample runs -

In [209]: df
Out[209]: 
    0   1   2   3
0  21   7  31  69
1  62  75  22  62  # ----|
2  16  46   9  31  #     |==> Identical rows, so must have same IDs
3  62  75  22  62  # ----|
4  24  12  88  15

In [210]: unique_return_inverse_2D(df.values)
Out[210]: array([1, 3, 0, 3, 2])

In [211]: unique_return_inverse_2D_viewbased(df.values)
Out[211]: array([1, 3, 0, 3, 2])



回答2:


I think you can convert columns to strings and then sum:

u, rev = np.unique(df.astype(str).values.sum(axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]

As pointed DSM (thank you), it is dangerous.

Another solution is convert rows to tuples:

u, rev = np.unique(df.apply(tuple, axis=1), return_inverse=True)
print (rev)
[0 1 2 2 3]


来源:https://stackoverflow.com/questions/43167413/using-numpy-unique-on-multiple-columns-of-a-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!