Find symmetric pairs quickly in numpy

后端 未结 6 1808
抹茶落季
抹茶落季 2020-12-03 17:44
from itertools import product
import pandas as pd

df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = \"c1 c2\".split()
df         


        
相关标签:
6条回答
  • 2020-12-03 18:02

    You can sort the values, then groupby:

    a= np.sort(df.to_numpy(), axis=1)
    df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
    

    Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:

    a= np.sort(df.to_numpy(), axis=1) 
    
    (df.assign(one=a[:,0], two=a[:,1])   # one and two can be changed
       .drop_duplicates(['one','two'])   # taken from above
       .reindex(df.columns, axis=1)
    )
    
    0 讨论(0)
  • 2020-12-03 18:03

    One way is using np.unique with return_index=True and use the result to index the dataframe:

    a = np.sort(df.values)
    _, ix = np.unique(a, return_index=True, axis=0)
    
    print(df.iloc[ix, :])
    
        c1  c2
    0    0   0
    1    0   1
    20   2   0
    3    0   3
    40   4   0
    50   5   0
    6    0   6
    70   7   0
    8    0   8
    9    0   9
    11   1   1
    21   2   1
    13   1   3
    41   4   1
    51   5   1
    16   1   6
    71   7   1
    ...
    
    0 讨论(0)
  • 2020-12-03 18:06

    I will do

    df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
    

    From pandas and numpy tri

    s=pd.crosstab(df.c1,df.c2)
    s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
    
    0 讨论(0)
  • 2020-12-03 18:07

    frozenset

    mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
    
    df[~mask]
    
    0 讨论(0)
  • 2020-12-03 18:15

    If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.

    0 讨论(0)
  • 2020-12-03 18:20

    Here's one NumPy based one for integers -

    def remove_symm_pairs(df):
        a = df.to_numpy(copy=False)
        b = np.sort(a,axis=1)
        idx = np.ravel_multi_index(b.T,(b.max(0)+1))
        sidx = idx.argsort(kind='mergesort')
        p = idx[sidx]
        m = np.r_[True,p[:-1]!=p[1:]]
        a_out = a[np.sort(sidx[m])]
        df_out = pd.DataFrame(a_out)
        return df_out
    

    If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].

    For generic numbers (ints/floats, etc.), we will use a view-based one -

    # https://stackoverflow.com/a/44999009/ @Divakar
    def view1D(a): # a is array
        a = np.ascontiguousarray(a)
        void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
        return a.view(void_dt).ravel()
    

    and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.

    0 讨论(0)
提交回复
热议问题