Check if pandas dataframe is subset of other dataframe

后端 未结 3 1229
盖世英雄少女心
盖世英雄少女心 2021-01-11 11:49

I have two Python Pandas dataframes A, B, with the same columns (obviously with different data). I want to check A is a subset of B, that is, all rows of A are contained in

相关标签:
3条回答
  • 2021-01-11 12:09

    You also can try:

    ex = pd.DataFrame({"col1": ["banana", "tomato", "apple"],
                   "col2": ["cat", "dog", "kangoo"],
                   "col3": ["tv", "phone", "ps4"]})
    ex2 = ex.iloc[0:2]
    ex2.isin(ex).all().all()
    

    It returns True

    If you try to switch some values such as tv and phone you get a False value

    ex2 = pd.DataFrame({"col1": ["banana", "tomato"],
                   "col2": ["cat", "dog"],
                   "col3": ["phone", "tv"]})
    ex2.isin(ex).all().all()
    >> False
    
    0 讨论(0)
  • 2021-01-11 12:19

    In the special case where you do not have any NaN/ None values, you can use np.in1d combined with np.stack and np.all:

    df1 = pd.DataFrame(np.arange(16).reshape(4, 4))
    df2 = pd.DataFrame(np.arange(40).reshape(10, 4))
    
    res = np.stack([np.in1d(df1.values[i], df2.values) for i in range(df1.shape[0])]).all()
    # True
    

    This will not deal with duplicates, e.g. 2 identical rows in df1 may match with 1 row in df2. But it is not clear whether this is an issue.

    0 讨论(0)
  • 2021-01-11 12:27

    Method DataFrame.merge(another_DF) merges on the intersection of the columns by default (uses all columns with same names from both DFs) and uses how='inner' - so we expect to have the same # of rows after inner join (if neither of DFs has duplicates):

    len(A.merge(B)) == len(A)
    

    PS it will NOT work properly if one of DFs have duplicated rows - see below for such cases

    Demo:

    In [128]: A
    Out[128]:
       A  B  C
    0  1  2  3
    1  4  5  6
    
    In [129]: B
    Out[129]:
       A  B  C
    0  4  5  6
    1  1  2  3
    2  9  8  7
    
    In [130]: len(A.merge(B)) == len(A)
    Out[130]: True
    

    for data sets containing duplicates, we can remove duplicates and use the same method:

    In [136]: A
    Out[136]:
       A  B  C
    0  1  2  3
    1  4  5  6
    2  1  2  3
    
    In [137]: B
    Out[137]:
       A  B  C
    0  4  5  6
    1  1  2  3
    2  9  8  7
    3  4  5  6
    
    In [138]: A.merge(B).drop_duplicates()
    Out[138]:
       A  B  C
    0  1  2  3
    2  4  5  6
    
    In [139]: len(A.merge(B).drop_duplicates()) == len(A.drop_duplicates())
    Out[139]: True
    
    0 讨论(0)
提交回复
热议问题