Pandas - Conditional Probability of a given specific b

后端 未结 5 1085
孤独总比滥情好
孤独总比滥情好 2021-01-03 02:32

I have DataFrame with two columns of \"a\" and \"b\". How can I find the conditional probability of \"a\" given specific \"b\"?

df.groupby(\'a\').groupby(\         


        
相关标签:
5条回答
  • 2021-01-03 03:00

    Answer:

    This is possible to do using Pandas crosstab function. Given the description of the problem where Dataframe is called 'df', with columns 'a' and 'b'

    pd.crosstab(df.a, df.b, normalize='columns')

    Will return a Dataframe representing P(a | b)

    https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html

    Explanation:

    Consider the DataFrame:

    df = pd.DataFrame({'a':['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z'],
                       'b':['1', '2', '3', '4','5', '1', '2', '3']})
    

    Looking at columns a and b

    df[["a", "b"]]

    We have

        a   b
    0   x   1
    1   x   2
    2   x   3
    3   y   4
    4   y   5
    5   y   1
    6   y   2
    7   z   3
    

    Then

    pd.crosstab(df.a, df.b)

    returns the frequency table of df.a and df.b with the rows being values of df.a and the columns being values of df.b

    b   1   2   3   4   5
    a                   
    x   1   1   1   0   0
    y   1   1   0   1   1
    z   0   0   1   0   0
    

    We can instead use the normalize keyword to get the table of conditional probabilities P(a | b)

    pd.crosstab(df.a, df.b, normalize='columns')

    Which will normalize based on column value, or in our case, return a DataFrame where the columns represent the conditional probabilities P(a | b=B) for specific values of B

    b    1   2   3   4   5
    a
    x   0.5 0.5 0.5 0.0 0.0
    y   0.5 0.5 0.0 1.0 1.0
    z   0.0 0.0 0.5 0.0 0.0
    

    Notice, the columns sum to 1.

    If we would instead prefer to get P(b | a), we could normalize over the rows

    pd.crosstab(df.a, df.b, normalize='rows')

    To get

    b      1           2           3         4       5
    a                   
    x   0.333333    0.333333    0.333333    0.00    0.00
    y   0.250000    0.250000    0.000000    0.25    0.25
    z   0.000000    0.000000    1.000000    0.00    0.00
    

    Where the rows represent the conditional probabilities P(b | a=A) for specific values of A. Notice, the rows sum to 1.

    0 讨论(0)
  • 2021-01-03 03:03

    You could try this function,

    def conprob(pd1,pd2,transpose=1):
        if transpose==0:
            table=pd.crosstab(pd1,pd2)
        else:
            table=pd.crosstab(pd2,pd1)
        cnames=table.columns.values
        weights=1/table[cnames].sum()
        out=table*weights
        pc=table[cnames].sum()/table[cnames].sum().sum()
        table=table.transpose()
        cnames=table.columns.values
        p=table[cnames].sum()/table[cnames].sum().sum()
        out['p']=p
        return out
    

    This return de conditional probability P( row |column )

    0 讨论(0)
  • 2021-01-03 03:11

    You can pass in a list to groupby:

    df.groupby(['a','b']).count()
    
    0 讨论(0)
  • 2021-01-03 03:16

    Consider the DataFrame that Maxymoo suggested:

    df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
    
    df
         A      B         C         D
    0  foo    one  0.229206 -1.899999
    1  bar    one  0.174972  0.328746
    2  foo    two -1.384699 -1.691151
    3  bar  three -1.008328 -0.915467
    4  foo    two -0.065298 -0.107240
    5  bar    two  1.871916  0.798135
    6  foo    one  1.589609 -1.682237
    7  foo  three  2.292783  0.639595
    

    Lets assume that we are interested to calculate the probability of (y = foo) given x = one: P(y=foo|x=one) = ?

    Approach 1:

    df.groupby('B')['A'].value_counts()/df.groupby('B')['A'].count()
    B         
    one    foo    0.666667
           bar    0.333333
    three  foo    0.500000
           bar    0.500000
    two    foo    0.666667
           bar    0.333333
    dtype: float64
    

    So the answer is: 0.6667

    Approach 2:

    Probability of x = one: 0.375

    df['B'].value_counts()/df['B'].count()
    one      0.375
    two      0.375
    three    0.250
    dtype: float64
    

    Probability of y = foo: 0.625

    df['A'].value_counts()/df['A'].count()
    foo    0.625
    bar    0.375
    dtype: float64
    

    Probability of (x=one|y=foo): 0.4

    df.groupby('A')['B'].value_counts()/df.groupby('A')['B'].count()
    A         
    bar  one      0.333333
         two      0.333333
         three    0.333333
    foo  one      0.400000
         two      0.400000
         three    0.200000
    dtype: float64
    

    Therefore: P(y=foo|x=one) = P(x=one|y=foo)*P(y=foo)/P(x=one) = 0.4 * 0.625 / 0.375 = 0.6667

    0 讨论(0)
  • 2021-01-03 03:17

    To find the total number of class b for each instance of class a you would do

    df.groupby('a').b.value_counts()
    

    For example, create a DataFrame as below:

    df = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B':['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C':np.random.randn(8), 'D':np.random.randn(8)})
    
         A      B         C         D
    0  foo    one -1.565185 -0.465763
    1  bar    one  2.499516 -0.941229
    2  foo    two -0.091160  0.689009
    3  bar  three  1.358780 -0.062026
    4  foo    two -0.800881 -0.341930
    5  bar    two -0.236498  0.198686
    6  foo    one -0.590498  0.281307
    7  foo  three -1.423079  0.424715
    

    Then:

    df.groupby('A')['B'].value_counts()
    
    A
    bar  one      1
         two      1
         three    1
    foo  one      2
         two      2
         three    1
    

    To convert this to a conditional probability, you need to divide by the total size of each group.

    You can either do it with another groupby:

    df.groupby('A')['B'].value_counts() / df.groupby('A')['B'].count()
    
    A
    bar  one      0.333333
         two      0.333333
         three    0.333333
    foo  one      0.400000
         two      0.400000
         three    0.200000
    dtype: float64
    

    Or you can apply a lambda function onto the groups:

    df.groupby('a').b.apply(lambda g: g.value_counts()/len(g))
    
    0 讨论(0)
提交回复
热议问题